2024q1 Homework2 (quiz1+2)

contributed by < ranvd >

第一周測驗 1

測驗一中的 quicksort 採用 begin 和 end 兩個堆疊的資料結構來替代遞迴所需的額外記憶體堆疊。主要是利用 begin 和 end 分別紀錄尚未處理的鏈結串列的起始位址和結束位址。

node_t *begin[max_level], *end[max_level];

每次迭代中會分別從 begin 和 end 中提取鏈結串列的起始和結束位址，並將其設定為 L 和 R，分別代表鏈結串列的最左邊和最右邊。接著，將 L 設定為 pivot，然後透過 while(p) 掃描目前正在處理的鏈結串列（即從 begin 到 end）。將大於pivot的節點放至 right 中，反之則放至 left 中。

如果在每次迭代時，從 begin 與 end 拿出的數值相同即代表該節點已經在正確的位置，因此可以將其加入已排序好的鏈結串列 result 中。

    while (i >= 0) {
        node_t *L = begin[i], *R = end[i];
        if (L != R) {
            node_t *pivot = L;
            value = pivot->value;
            node_t *p = pivot->next;
            pivot->next = NULL;
    
            while (p) {
                node_t *n = p;
                p = it->next;
                list_add(n->value > value ? &right : &left, n);
            }

            begin[i] = left;
            end[i] = list_tail(&left);
            begin[i + 1] = pivot;
            end[i + 1] = pivot;
            begin[i + 2] = right;
            end[i + 2] = list_tail(&right);

            left = right = NULL;
            i += 2;
        } else {
            if (L)
                list_add(&result, L);
            i--;
        }
    }

從程式碼中可以看出，對於每次排序需要先知道整個陣列的長度 n，並且需要而外的 2n * 2 * sizeof (node_t) 的記憶體堆疊空間。當 n 過大的時候會引發 segmentation fault。

且每次都會固定將 left、pivot 與 right 的起始與結束資訊放入 begin 與 end，即使該指標指向 NULL。這不只會造成空間的浪費，同時也會增加而外的運算次數，需要特別去判斷從 begin、end 拿出來的資訊是否為 NULL。

改進方案 (Linux 風格 inplace quicksort)

Quicksort 的 worse case 是當資料已經接近排序完成時，很容易選到資料內最大或最小的資料，在最差的情況下，這會導致複雜度提升至

O (n^{2})

。為了解決這個問題，可以在快排序完成時使用 bubble sort 或 selection sort 來提升速度，或是在選擇 pivot 時考慮從多個節點中選擇。這樣可避免選到最大或最小的資料。

在本次改進方案中並沒有針對上述 worst case 的情況進行優化，主要是想要解決過多的記憶體堆疊消耗。為了解決過多的記憶體使用，我將上述 quicksort 改成 inplace 的版本。想法是重複利用指標，利用空閒的指標去串聯已經排序好的鏈結串列。

首先，我將環狀雙向鏈結串列轉換成 linear doubly linked-list。接著，使用 prev 指標來連接已經排序好的鏈結串列。同時，這裡使用了指標的指標，讓 h 永遠指向排序好的鏈結串列，而 *ih 則指向正在排序的鏈結串列。

struct list_head *h = head->next, **ih = &h;
list_del_init(head);
h->prev = h->prev->next = NULL;

首先將鏈結串列的第一個節點設定為 pivot，以下圖為例即是 5，接著將大於 5 的節點放至 right；小於 5 的放至 left。接著將 pivot->prev = left 且 right->prev = pivot，這樣 pivot 就放在正確的位置了。接著移動 ih，走向下一個需要排序的鏈結串列，如果 right 不為 NULL，則將 *ih 移動至 right。這樣一來就可以讓 *ih 永遠指向需要排序的鏈結串列。

接著，在每次迭代時檢查該鏈結串列是否只有一個元素，如果只有一個元素則代表該節點已經在正確的位置，只需要將 ih 移動至下一個鏈結串列即可。等到 *ih 為 NULL 時就代表所有節點都在正確的位置了，因此跳離迴圈，將整個由 prev 連接的節點從 h 開始一個一個加回至 head 鏈結串列。

起始狀態。

經過一次迭代。

    while (*ih) {
        if (!(*ih)->next) {
            ih = &(*ih)->prev;
            continue;
        }

        struct list_head *pivot = (*ih);
        struct list_head *left = NULL, *right = NULL;
        struct list_head **ileft = &left, **iright = &right;
        struct list_head *it = pivot->next;
        while (it) {
            struct list_head *n = it;
            it = it->next;
            if (compare(n, pivot) > 0) {
                *iright = n;
                iright = &(*iright)->next;
            } else {
                *ileft = n;
                ileft = &(*ileft)->next;
            }
        }

        *ileft = *iright = pivot->next = NULL;
        if (left) {
            left->prev = pivot->prev;
            pivot->prev = left;
        }
        if (right) {
            right->prev = pivot;
            (*ih) = right;
        }
    }
    struct list_head *safe = h->prev;
    for (; h; h = safe, safe = safe->prev) {
        list_add_tail(h, head);
        if (safe == NULL)
            break;
    }

我們將原本的實作與改進過後的實作進行比較，當節點數量在 262000 左右時就會發生 segmentation fault，透過 GDB 觀察，確定 segmentation fault 是發生在 begin 與 end 宣告時。在尋找原因過後發現是因為執行程式時預設的 stack size 限制為 8MB，可透過 setrlimit 更改設定 stack 大小後即可變更可運行的 n 的數量。

Program received signal SIGSEGV, Segmentation fault.
0x00005555555554fd in quicksort (list=0x7fffffffdb80, 
    cmp=0x5555555552f4 <lessthan>) at link-list-non-recursive.c:73
73	    node_t *begin[2 * n], *end[2 * n];

接著我比較兩個實作在 n=10000 時的記憶體堆疊使用量。
原本的實作 (n=10000)

Number of snapshots: 65
 Detailed snapshots: [1, 5, 6, 7, 9 (peak), 23, 30, 35, 56]
--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 57     12,602,687          600,640          200,000        80,008      320,632
 58     12,733,188          600,640          200,000        80,008      320,632
 59     12,863,681          600,640          200,000        80,008      320,632
 60     12,994,231          600,640          200,000        80,008      320,632
 61     13,124,902          600,640          200,000        80,008      320,632
 62     13,255,401          281,496          201,024        80,016          456
 63     13,385,901          281,496          201,024        80,016          456
 64     13,516,401          281,496          201,024        80,016          456

改進後的實作 (n=10000)

Number of snapshots: 70
 Detailed snapshots: [14, 15, 16, 17, 18, 19, 20, 30, 32, 46, 51, 61, 67, 68 (peak)]
--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 62      3,865,190          280,544          200,000        80,008          536
 63      3,905,150          280,544          200,000        80,008          536
 64      3,945,111          280,544          200,000        80,008          536
 65      3,985,077          280,544          200,000        80,008          536
 66      4,134,402          280,464          200,000        80,008          456
 67      4,134,484          281,040          200,472        80,024          544

從上面兩個表格中可以看到，改進過後的實作在 stacks 的使用量大幅的減少，且執行時間相比原本的版本也有下降 (如下圖)。不過這與我原本預想的結果並不相同，原本預計兩個的執行速度應該要差不多，但從圖上來看，兩者的執行速度有顯著差距，目前還沒有理解造成的原因。

Image Not Showing Possible Reasons

The image was uploaded to a note which you don't have access to
The note which the image was originally uploaded to has been deleted

Learn More →

第一周測驗 2

待補

第二周測驗 1

由於 preorder 會直接輸出走訪的節點而 inorder 則會由小到大輸出節點。因此每從 preorder 中拿出一個節點後，如果可以知道對應到 inorder 的位置時就可以知道在目前走訪的 preorder 節點中左右子樹分別還有哪些節點。

由於一般鏈結串列查找節點在 inorder 的位置為

O (n)

複雜度，因此可以考慮使用 hash，使得查詢的複雜度降低至

O (1)

。下列程式碼即是將節點放入 hash table 的方式，首先會要求一塊記憶體，接著將要賦予的 val 透過簡單的 hash function 得到 index 後，使用得到的 index 指定 hash table 的位置，將整個節點加入 hash table 中正確的位置。

static inline void node_add(int val,
                            int idx,
                            int size,
                            struct hlist_head *heads)
{
    struct order_node *on = malloc(sizeof(*on));
    on->val = val;
    on->idx = idx;
    int hash = (val < 0 ? -val : val) % size;
    hlist_add_head(&on->node, &heads[hash]/*DDDD*/);
}

首先我們要定出走訪 preorder 與 inorder 走訪的下界與上界，在這個區間內走訪 preorder 的元素時需要找到元素對應在 inorder 的位置，接著在這上下界內的 inorder 左邊的元素必定在左子樹；反之，在右邊必定在右子樹。考慮以下範例

當第一次走訪的時候走訪至 7，在找到 7 對應到 inorder 的位置後可以將 inorder 切成兩半，左邊是 3, 5, 6 右邊則是 9, 10。左右子樹的 preorder 則可以透過 inorder 在左邊與右邊陣列的大小得知應該要再走訪幾個元素，因此以這個例子來說就是左子樹的 preorder 就是 5, 3, 6；右子樹的 preorder 就是 10, 9。

以下為左右子樹的 inorder 與 preorder，透過這種方式不斷遞迴下去就可以正確的重建樹狀結構。

以下程式碼則是將上述的想法進行實作，不過實作對於左右子樹的 inorder, preorder 是使用 pre_low, pre_high 等變數限制走訪的上下界。

static struct TreeNode *dfs(int *preorder,
                            int pre_low,
                            int pre_high,
                            int *inorder,
                            int in_low,
                            int in_high,
                            struct hlist_head *in_heads,
                            int size)
{
    if (in_low > in_high || pre_low > pre_high)
        return NULL;

    struct TreeNode *tn = malloc(sizeof(*tn));
    tn->val = preorder[pre_low];
    int idx = find(preorder[pre_low], size, in_heads);
    tn->left = dfs(preorder, pre_low + 1, pre_low + (idx - in_low), inorder,
                   in_low, idx - 1, in_heads, size);
    tn->right = dfs(preorder, pre_high - (in_high - idx - 1), pre_high, inorder,
                    idx + 1, in_high, in_heads, size);
    return tn;
}

改進方案

可以修改雜湊函數降低碰撞次數，從下面的程式碼中可以看到此雜湊函數非常的簡單，而且一眼就可以看出當

| val |

相同時必定會發生碰撞。如果能降低碰撞次數，可以減少 node_add 與 find 這兩個函數的執行時間，讓 hash 盡量接近理論的

O (1)

複雜度。

static inline void node_add(int val,
                            int idx,
                            int size,
                            struct hlist_head *heads)
{
    struct order_node *on = malloc(sizeof(*on));
    on->val = val;
    on->idx = idx;
    int hash = (val < 0 ? -val : val) % size;
    hlist_add_head(&on->node, &heads[hash]/*DDDD*/);
}

static int find(int num, int size, const struct hlist_head *heads)
{
    struct hlist_node *p;
    int hash = (num < 0 ? -num : num) % size;
    ...略
}

因此可以使用 Multiplication method 作為雜湊的方式，參考 Linux 核心的 hash table 實作提及在 linux/hash.h 的實作 (如下程式碼)。我們可以將 Multiplication method 中乘以的常數用 GOLDEN_RATIO_32 來代換，以此降低雜湊時的碰撞。

#define GOLDEN_RATIO_32 0x61C88647

#ifndef HAVE_ARCH__HASH_32
#define __hash_32 __hash_32_generic
#endif
static inline u32 __hash_32_generic(u32 val)
{
    return val * GOLDEN_RATIO_32;
}

static inline u32 hash_32(u32 val, unsigned int bits)
{
    /* High bits are more random, so use them. */
    return __hash_32(val) >> (32 - bits);
}

Linux 核心的 cgroups 相關原始程式碼

在 include/linux/cgroup.h 中可以看到以下程式碼。不過這段程式碼牽涉到同步的觀念，很多程式碼與註解看不懂，以下就只能稍微的解釋程式碼的運作流程。

下面會以 css 為第一個節點進行走訪，透過 css_next_descendant_pre 取出下一個節點。

#define css_for_each_descendant_pre(pos, css)				\
	for ((pos) = css_next_descendant_pre(NULL, (css)); (pos);	\
	     (pos) = css_next_descendant_pre((pos), (css)))

在展開 css_next_descendant_pre 的程式碼前需要先提到 cgroup_subsys_state 結構體。cgroup_subsys_state 裡面包含了兩個鍊結串列 sibling 與 children，功能就是紀錄自己的子節點與兄弟節點。結構體中還包含了 serial_nr，用來比較兄弟節點間的順序和 flags 用來紀錄結構體的狀態。同時也用了一個指標 parent 指向父親節點。

struct cgroup_subsys_state {
    ...略
	/* siblings list anchored at the parent's ->children */
	struct list_head sibling;
	struct list_head children;

    ...略
	unsigned int flags;

	/*
	 * Monotonically increasing unique serial number which defines a
	 * uniform order among all csses.  It's guaranteed that all
	 * ->children lists are in the ascending order of ->serial_nr and
	 * used to allow interrupting and resuming iterations.
	 */
	u64 serial_nr;

	/*
	 * Incremented by online self and children.  Used to guarantee that
	 * parents are not offlined before their children.
	 */
	atomic_t online_cnt;

    ...略
	struct cgroup_subsys_state *parent;
};

從 css_next_descendant_pre 中的程式碼可以看到，當 pos == NULL 代表第一次呼叫。因此直接回傳 root 作為第一次走訪的點。在接下來的呼叫中，如果 pos 底下還有 child 的話就會直接使用 css_next_child(NULL, pos) 得到下一個節點。如果沒有的話就會進入到 while 進行迭代。在迭代的過程中透過 css_next_child(pos, pos->parent) 尋找下一個節點的位置。如果找不到代表應該要往上一層進行尋找。

struct cgroup_subsys_state *
css_next_descendant_pre(struct cgroup_subsys_state *pos,
			struct cgroup_subsys_state *root)
{
	struct cgroup_subsys_state *next;

	cgroup_assert_mutex_or_rcu_locked();

	/* if first iteration, visit @root */
	if (!pos)
		return root;

	/* visit the first child if exists */
	next = css_next_child(NULL, pos);
	if (next)
		return next;

	/* no child, visit my or the closest ancestor's next sibling */
	while (pos != root) {
		next = css_next_child(pos, pos->parent);
		if (next)
			return next;
		pos = pos->parent;
	}

	return NULL;
}

css_next_child 可以找出 parent 的子節點在 pos 之後的下一個節點。如果 pos == NULL 則從 parent->children 中拿出第一個節點。如果 pos 不為 NULL，則從 pos 的 sibling 的下一個節點拿出資料。根據他的註解說明如果 pos.flag 有 CSS_RELEASED 的話 pos 的下一個節點就不能 dereferenced，必須從 parent 走訪，因此下面才會使用 list_for_each_entry_rcu 來走訪。

struct cgroup_subsys_state *css_next_child(struct cgroup_subsys_state *pos,
					   struct cgroup_subsys_state *parent)
{
	struct cgroup_subsys_state *next;

	cgroup_assert_mutex_or_rcu_locked();

	/*
	 * @pos could already have been unlinked from the sibling list.
	 * Once a cgroup is removed, its ->sibling.next is no longer
	 * updated when its next sibling changes.  CSS_RELEASED is set when
	 * @pos is taken off list, at which time its next pointer is valid,
	 * and, as releases are serialized, the one pointed to by the next
	 * pointer is guaranteed to not have started release yet.  This
	 * implies that if we observe !CSS_RELEASED on @pos in this RCU
	 * critical section, the one pointed to by its next pointer is
	 * guaranteed to not have finished its RCU grace period even if we
	 * have dropped rcu_read_lock() in-between iterations.
	 *
	 * If @pos has CSS_RELEASED set, its next pointer can't be
	 * dereferenced; however, as each css is given a monotonically
	 * increasing unique serial number and always appended to the
	 * sibling list, the next one can be found by walking the parent's
	 * children until the first css with higher serial number than
	 * @pos's.  While this path can be slower, it happens iff iteration
	 * races against release and the race window is very small.
	 */
	if (!pos) {
		next = list_entry_rcu(parent->children.next, struct cgroup_subsys_state, sibling);
	} else if (likely(!(pos->flags & CSS_RELEASED))) {
		next = list_entry_rcu(pos->sibling.next, struct cgroup_subsys_state, sibling);
	} else {
		list_for_each_entry_rcu(next, &parent->children, sibling,
					lockdep_is_held(&cgroup_mutex))
			if (next->serial_nr > pos->serial_nr)
				break;
	}

	/*
	 * @next, if not pointing to the head, can be dereferenced and is
	 * the next sibling.
	 */
	if (&next->sibling != &parent->children)
		return next;
	return NULL;
}

第二周測驗 2

在測驗中的實作結合了 linux 風格的鏈結串列與雜湊表來實作 LRU cache，在 LRUCache 結構體中定義了雜湊表 (hhead) 用來做常數時間的資料存取，並透過環狀鏈結串列 (dhead) 紀錄資料使用的形況，以 LRU 進行排序。

typedef struct {
    int capacity;
    int count;
    struct list_head dhead;
    struct hlist_head hhead[];
} LRUCache;

typedef struct {
    int key;
    int value;
    struct hlist_node node;
    struct list_head link;
} LRUNode;

使用 lRUCache 取得快取內的資料。會先將輸入的 key 經過雜湊函數得到資料的位置後，使用 hlist_for_each 進行迭代，過程中尋找是否有相符的資料，如果有的話就將資料在鏈結串列中的位置移動至開頭；如果沒有的話則回傳 -1。這可以使得 dhead 的順序永遠符合 LRU 的規定。

int lRUCacheGet(LRUCache *obj, int key)
{
    int hash = key % obj->capacity;
    struct hlist_node *pos;
    hlist_for_each (pos, &obj->hhead[hash]) {
        LRUNode *cache = list_entry(pos, LRUNode, node /*HHHH*/);
        if (cache->key == key) {
            list_move(&cache->link /*IIII*/, &obj->dhead);
            return cache->value;
        }
    }
    return -1;
}

使用 lRUCachePut 來新增資料，在一開始會先檢查快取內是否有已經有資料，如果有的話就更新其數值並將該節點移至鏈結串列 (dhead) 的開頭；反之，如果沒有的話就會檢查是否超過快取的上限，如果超過的話就從 dhead 尾巴移除資料，如果沒有超過則將節點新增至 hhead 和 dhead 的中。

void lRUCachePut(LRUCache *obj, int key, int value)
{
    LRUNode *cache = NULL;
    int hash = key % obj->capacity;
    struct hlist_node *pos;
    hlist_for_each (pos, &obj->hhead[hash]) {
        LRUNode *c = list_entry(pos, LRUNode, node /*JJJJ*/);
        if (c->key == key) {
            list_move(&c->link /*KKKK*/, &obj->dhead);
            cache = c;
        }
    }

    if (!cache) {
        if (obj->count == obj->capacity) {
            cache = list_last_entry(&obj->dhead, LRUNode, link);
            list_move(&cache->link, &obj->dhead);
            hlist_del(&cache->node);
            hlist_add_head(&cache->node, &obj->hhead[hash]);
        } else {
            cache = malloc(sizeof(LRUNode));
            hlist_add_head(&cache->node, &obj->hhead[hash]);
            list_add(&cache->link, &obj->dhead);
            obj->count++;
        }
        cache->key = key;
    }
    cache->value = value;
}

改進方案

目前只想到與第二周測驗 1 相同的修改方式，即更改 hash function 來降低 miss rate。

2024q1 Homework2 (quiz1+2)

第一周測驗 1

改進方案 (Linux 風格 inplace quicksort)

第一周測驗 2

第二周測驗 1

改進方案

Linux 核心的 cgroups 相關原始程式碼

第二周測驗 2

改進方案

Linux 核心中 LRU 相關程式碼

第二周測驗 3

Linux 核心 find_nth_bit 的應用案例

2024q1 Homework2 (quiz1+2)

第一周測驗 1

改進方案 (Linux 風格 inplace quicksort)

第一周測驗 2

第二周測驗 1

改進方案

Linux 核心的 cgroups 相關原始程式碼

第二周測驗 2

改進方案

Linux 核心中 LRU 相關程式碼

第二周測驗 3

Linux 核心 find_nth_bit 的應用案例

Read more

NVIDIA Driver 與 cuda 安裝

2024q1 Homework5 (assessment)

2024q1 Homework6 (integration)

2024q1 Homework4 (quiz3+4)