2022q1 Homework1 (lab0)

contributed by < arthurchang09 >

開發環境

目前是在 VirtualBox 上開發

$ uname -r
5.13.0-28-generic

$ gcc --version
gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   39 bits physical, 48 bits virtual
CPU(s):                          1
On-line CPU(s) list:             0
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           165
Model name:                      Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz
Stepping:                        5
CPU MHz:                         3792.002
BogoMIPS:                        7584.00
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       32 KiB
L1i cache:                       32 KiB
L2 cache:                        256 KiB
L3 cache:                        16 MiB
NUMA node0 CPU(s):               0

環境設置

安裝必要的開發工具套件

$ sudo apt install build-essential git-core valgrind
$ sudo apt install cppcheck clang-format aspell colordiff

中英文之間用一個半型空白區隔

檢查 cppcheck 版本

$ cppcheck --version
Cppcheck 1.90

過舊,移除並嘗試安裝最新版本

$ sudo snap install cppcheck
error: snap "cppcheck" not found

因此要自己編譯了,參考 2020q1 Homework1 (lab0) - IepIweidieng
先移除 cppcheck

$ sudo apt remove cppcheck

取得原始碼後執行

$ mkdir build/
$ cd build/
$ cmake .. -DHAVE_RULES=ON -DUSE_MATCHCOMPILER=ON
$ cmake --build .

執行 cmake .. -DHAVE_RULES=ON -DUSE_MATCHCOMPILER=ON 發生錯誤

CMake Error at cmake/findDependencies.cmake:16 (message):
  pcre dependency for RULES has not been found

cmake/findDependencies 找到是缺少 pcre.h 之類的library

if (HAVE_RULES)
    find_path(PCRE_INCLUDE pcre.h)
    find_library(PCRE_LIBRARY pcre)
    if (NOT PCRE_LIBRARY OR NOT PCRE_INCLUDE)
        message(FATAL_ERROR "pcre dependency for RULES has not been found")
    endif()
endif()

安裝 libpcre3-dev

$ sudo apt-get install libpcre3-dev

完成 cmake .. -DHAVE_RULES=ON -DUSE_MATCHCOMPILER=ON

-- Using bundled version of tinyxml2
-- ------------------ General configuration for Cppcheck 2.7 -----------------
-- 
-- CMake Generator =       Unix Makefiles
-- Compiler =              GNU
-- Compiler Version =      9.3.0
-- Build type =            Debug
-- CMAKE_INSTALL_PREFIX =  /usr/local
-- CMAKE_DISABLE_PRECOMPILE_HEADERS = Off
-- C++ flags (General) =   
-- C++ flags (Release) =   -O3 -DNDEBUG
-- C++ flags (RelWithDebInfo) = -O2 -g -DNDEBUG
-- C++ flags (Debug) =     -g
-- Found Define: _GLIBCXX_DEBUG
-- Found Define: HAVE_RULES
-- Found Define: TIXML_USE_STL
-- Found Define: FILESDIR="/usr/local/share/Cppcheck"
-- 
-- ---------------------------------------------------------
-- ANALYZE_MEMORY =        OFF
-- ANALYZE_ADDRESS =       OFF
-- ANALYZE_THREAD =        OFF
-- ANALYZE_UNDEFINED =     OFF
-- ANALYZE_DATAFLOW =      OFF
-- WARNINGS_ARE_ERRORS =   OFF
-- 
-- USE_MATCHCOMPILER =     ON
-- USE_MATCHCOMPILER_OPT = ON
-- 
-- BUILD_SHARED_LIBS =     OFF
-- LIBXML2_XMLLINT_EXECUTABLE = LIBXML2_XMLLINT_EXECUTABLE-NOTFOUND
-- BUILD_TESTS =           OFF
-- ENABLE_CHECK_INTERNAL = OFF
-- ENABLE_OSS_FUZZ =       ON
-- 
-- BUILD_GUI =             OFF
-- WITH_QCHART =           OFF
-- 
-- HAVE_RULES =            ON
-- PCRE_LIBRARY =          /usr/lib/x86_64-linux-gnu/libpcre.so
-- 
-- PYTHON_VERSION_STRING = 3.8.2
-- PYTHON_EXECUTABLE =     /usr/bin/python3
-- 
-- USE_Z3 =                OFF
-- USE_BUNDLED_TINYXML2 =  ON
-- 
-- NPROC=1
-- RUN_CLANG_TIDY=RUN_CLANG_TIDY-NOTFOUND
-- Configuring done
-- Generating done
-- Build files have been written to: /home/hi/cppcheck/build

編譯完成,得以下輸出

Scanning dependencies of target cppcheck
[100%] Building CXX object cli/CMakeFiles/cppcheck.dir/main.cpp.o
[100%] Linking CXX executable ../bin/cppcheck
[100%] Built target cppcheck

接著執行安裝,首先安裝 checkinstall

$ sudo apt install checkinstall

執行以下命令

$ sudo checkinstall

安裝完成出現

**********************************************************************

 Done. The new package has been installed and saved to

 /home/hi/cppcheck/build/build_20220216-1_amd64.deb

 You can remove it from your system anytime using: 

      dpkg -r build

**********************************************************************

最後再次確認版本

$ cppcheck --version
Cppcheck 2.7

queue.c 的實作

q_new()

struct list_head *q_new()
{
    struct list_head *q = malloc(sizeof(struct list_head));
    if (!q) {
        return NULL;
    }
    q->prev = q;
    q->next = q;
    return q;
}

使用 malloc 取得記憶體空間。若未取的空間,回傳 NULL ,若成功,根據 list.h 的說明,將 nextprev 指回自己本身。

The @prev pointer of the list head points to the last list node of the list and @next points to the first list node of the list. For an empty list, both member variables point to the head.

q_free()

void q_free(struct list_head *l)
{
    if (!l) {
        return;
    }
    l->prev->next = NULL;
    while (l->next) {
        // cppcheck-suppress nullPointer
        element_t *rm = list_entry(l->next, element_t, list);
        l->next = l->next->next;
        q_release_element(rm);
    }
    free(l);
}

檢查 l 是否為 NULL 。若為否,使用 while 迴圈走訪整個 linked-list , 使用 list_enrty 取得整個 node 位址,呼叫 q_release_element 釋放記憶資源,最後釋放 l 的記憶體資源。

q_insert_head()

bool q_insert_head(struct list_head *head, char *s)
{
    if (!head) {
        return false;
    }
    size_t s_len = strlen(s) + 1;
    element_t *new_ele_node = NULL;
    if (!(new_ele_node = malloc(sizeof(element_t)))) {
        return false;
    }
    if (!(new_ele_node->value = (char *) malloc(s_len))) {
        free(new_ele_node);
        return false;
    }
    memcpy(new_ele_node->value, s, s_len);
    struct list_head *node_list = &new_ele_node->list;
    new_ele_node->list.prev = head;
    new_ele_node->list.next = head->next;
    head->next->prev = node_list;
    head->next = node_list;
    return true;
}

檢查完 headNULL 。 使用 strlen 取得字串長度,並使用 malloc 取得 element_t 以及其 value 之記憶體空間。使用 memcpy 將字串複製到此空間。再將此 node 串在整個 linked-list 的開頭。

q_insert_tail()

bool q_insert_tail(struct list_head *head, char *s)
{
    if (!head) {
        return false;
    }
    size_t s_len = strlen(s) + 1;
    element_t *new_ele_node = NULL;
    if (!(new_ele_node = malloc(sizeof(element_t)))) {
        return false;
    }
    if (!(new_ele_node->value = (char *) malloc(s_len))) {
        free(new_ele_node);
        return false;
    }
    memcpy(new_ele_node->value, s, s_len);
    struct list_head *node_list = &new_ele_node->list;
    new_ele_node->list.prev = head->prev;
    new_ele_node->list.next = head;
    head->prev->next = node_list;
    head->prev = node_list;

    return true;
}

實作方式和 q_insert_head() 類似,不同之處在將節點插入的方式。將相關節點對應的指標指向 head 及 linked list 尾端的節點,完成插入尾端之操作。

總是寫作為 "linked list",中間不用有連字號

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
jserv

q_remove_head()

element_t *q_remove_head(struct list_head *head, char *sp, size_t bufsize)
{
    if (!head || !head->next || head->next == head) {
        return NULL;
    }
    // cppcheck-suppress nullPointer
    element_t *rm = list_entry(head->next, element_t, list);
    if (sp) {
        strncpy(sp, rm->value, bufsize - 1);
        sp[bufsize - 1] = '\0';
    }
    rm->list.next->prev = head;
    head->next = head->next->next;

    return rm;
}

使用 list_entry 取第一個節點的指標。使用 strncp 將字串複製到指定的位址,並強制在最尾端寫入 \0 ,避免 sp 所指向之空間較小,使其他程式讀取時發稱錯誤。將相關節點對應的指標指向 head 及 linked list 尾端,完成刪除開頭之操作。

q_remove_tail()

element_t *q_remove_tail(struct list_head *head, char *sp, size_t bufsize)
{
    if (!head || !head->next || head->next == head) {
        return NULL;
    }
    // cppcheck-suppress nullPointer
    element_t *rm = list_entry(head->prev, element_t, list);
    if (sp) {
        memcpy(sp, rm->value, bufsize);
        sp[bufsize - 1] = '\0';
    }
    rm->list.prev->next = head;
    head->prev = head->prev->prev;
    return rm;
}

實作方式與 q_remove_head() 相似,不同之處在將節點刪除的方式。將相關節點對應的指標指向 head 及 linked list 尾端的節點,完成刪除尾端之操作。

q_size()

int q_size(struct list_head *head)
{
    int lengh = 0;
    if (!head) {
        return 0;
    }
    struct list_head *curr = head->next;
    while (curr != head) {
        lengh++;
        curr = curr->next;
    }
    return lengh;
}

使用 curr 走訪整個 link list ,每經過一個節點將記錄長度的 lengh 加一。最終取得整個 link list 長度。

q_delete_mid()

bool q_delete_mid(struct list_head *head)
{
    // https://leetcode.com/problems/delete-the-middle-node-of-a-linked-list/
    if (!head || q_size(head) == 0) {
        return false;
    }
    struct list_head **indir = &head;
    for (struct list_head *fast = head->next;
         fast != head && fast->next != head; fast = fast->next->next) {
        indir = &(*indir)->next;
    }
    struct list_head *del = *indir;
    del = del->next;
    del->prev->next = del->next;
    del->next->prev = del->prev;
    // cppcheck-suppress nullPointer
    element_t *del_n = list_entry(del, element_t, list);
    q_release_element(del_n);
    return true;
}

參考 你所不知道的 C 語言: linked list 和非連續記憶體 中的實作。使用快慢指標取得中間節點的位址,並且將其前後的節點對應的指標相連。最後使用 list_entry 取得整的節點的位址, 呼叫 q_release_element() 釋放記憶體空間。

一開始快慢指標的狀態如下圖。







G



e1

head

prev

next



e2

gerbil

prev

next



e1:e->e2:w






e3

bear

prev

next



e2:e->e3:w






e4

dolphin

prev

next



e3:e->e4:w






e4:s->e1:s






p1
fast



p1->e2





p2
*indir



p2->e1





fast 達到終止條件時,如下圖。較慢的指標 *indir 指向的下一個節點才是正中間的節點,因此迴圈結束時要在移動到下一個節點。







G



e1

head

prev

next



e2

gerbil

prev

next



e1:e->e2:w






e3

bear

prev

next



e2:e->e3:w






e4

dolphin

prev

next



e3:e->e4:w






e4:s->e1:s






p1
fast



p1->e4





p2
*indir



p2->e2





ToDo :
一開始條件判斷的 q_size 會走訪一次 linked list,可能減少效能,需要替代方式。

q_delete_dup()

bool q_delete_dup(struct list_head *head)
{
    // https://leetcode.com/problems/remove-duplicates-from-sorted-list-ii/
    if (!head) {
        return false;
    }
    struct list_head *node, *next;
    bool next_same_val = false;
    for (node = (head)->next; node != (head);) {
        next = node->next;
        // cppcheck-suppress nullPointer
        element_t *del = list_entry(node, element_t, list);
        if (next != head &&
            strcmp(del->value,
                   // cppcheck-suppress nullPointer
                   list_entry(next, element_t, list)->value) == 0) {
            next_same_val = true;
            node->prev->next = next;
            next->prev = node->prev;
            q_release_element(del);
        } else if (next_same_val) {
            next_same_val = false;
            node->prev->next = next;
            next->prev = node->prev;
            q_release_element(del);
        }
        node = next;
    }
    return true;
}

使用 for 迴圈走訪整個 linked list。 nodenext 分別指向現在走訪到的節點以及下一個要走訪的節點。 確認 next 不是指向 head ,使用 list_entry 找到節點對應的指標並比較其 value 的值。若相同,則刪除現在的節點,即 node 對應的節點,且將 next_same_value 設成 true,表示下一個節點也需要刪除。當 node 到了 linked list 最後一個節點且此點和上一節點值相同,需被刪除時,由於 next 指向 head,不會觸發前述的刪除機制,因此需要依靠 next_same_value 來確認是否需要刪除,並完成相應的刪除動作。

q_swap()

void q_swap(struct list_head *head)
{
    // https://leetcode.com/problems/swap-nodes-in-pairs/
    struct list_head *node;
    for (node = (head)->next; node != (head) && node->next != head;
         node = node->next) {
        struct list_head *tmp_head = node->next;
        list_del(node);
        list_add(node, tmp_head);
    }
}

使用 for 迴圈 走訪整個 linked list,其中 node 指向現在走訪到的節點,並用 tmp_head 指向下一個節點。如下圖。







G



e1

head

prev

next



e2

gerbil

prev

next



e1:e->e2:w






e3

bear

prev

next



e2:e->e3:w






e4

dolphin

prev

next



e3:e->e4:w






e4:s->e1:s






p1
node



p1->e2





p2
tmp_head



p2->e3





使用 list_delnode 指向的節點移出 linked list ,並連接好前後的節點。如下圖。







G



e1

head

prev

next



e3

bear

prev

next



e1:e->e3:ew






e2

gerbil

prev

next



e2:s->e1





e2:e->e3:w





e4

dolphin

prev

next



e3:e->e4:w






e4:s->e1:s






p1
node



p1->e2





p2
tmp_head



p2->e3





使用list_add 將移出的節點重新插入 linked list ,並以 tmp_head 作為 list_add 所需的 head 插入。因此, node 會插入在 tmp_head 的後方。 完成 swap 的動作。







G



e1

head

prev

next



e3

bear

prev

next



e1:e->e3:w






e2

gerbil

prev

next



e4

dolphin

prev

next



e2:e->e4:w






e3:e->e2:w






e4:s->e1:s






p1
node



p1->e2





p2
tmp_head



p2->e3





q_reverse()

void q_reverse(struct list_head *head)
{
    if (!head || !head->next || head->next->next == head) {
        return;
    }
    struct list_head *curr = head->next;
    struct list_head *temp = NULL;
    while (curr != head) {
        temp = curr->next;
        curr->next = curr->prev;
        curr->prev = temp;
        curr = temp;
    }
    temp = head->next;
    head->next = head->prev;
    head->prev = temp;
}

使用 while 迴圈走訪每個節點。用 temp 暫存下一個節點的位址,並且將 prevnext 做交換,接著走訪下一個節點。最後將 headprevnext做交換。如此即完成反轉的操作。

q_sort()

Merge sort 的部分額外使用兩個 function 實作。 merge() 負責將兩個 linked list 合併在一起。 merge_sort_list 將 linked list 拆分並且呼叫 merge 合併拆分的 linked list。

struct list_head *merge(struct list_head *l1,
                        struct list_head *l2,
                        struct list_head *head)
{
    if (!l1 || l1 == head) {
        return l2;
    }
    if (!l2 || l2 == head) {
        return l1;
    }
    struct list_head *tmp_head, *curr;
    // cppcheck-suppress nullPointer
    element_t *l1_node = list_entry(l1, element_t, list);
    // cppcheck-suppress nullPointer
    element_t *l2_node = list_entry(l2, element_t, list);
    if (strcmp(l1_node->value, l2_node->value) < 0) {
        tmp_head = l1;
        l1->prev = head;
        l1 = l1->next;
    } else {
        tmp_head = l2;
        l2->prev = head;
        l2 = l2->next;
    }
    curr = tmp_head;
    while (l1 != head && l2 != head) {
        // cppcheck-suppress nullPointer
        l1_node = list_entry(l1, element_t, list);
        // cppcheck-suppress nullPointer
        l2_node = list_entry(l2, element_t, list);
        if (strcmp(l1_node->value, l2_node->value) < 0) {
            curr->next = l1;
            l1->prev = curr;
            l1 = l1->next;
        } else {
            curr->next = l2;
            l2->prev = curr;
            l2 = l2->next;
        }
        curr = curr->next;
    }
    if (l1 == head) {
        curr->next = l2;
        l2->prev = curr;
    } else {
        curr->next = l1;
        l1->prev = curr;
    }
    return tmp_head;
}

以上為 merge() 的程式。原本採用遞迴的方式實作。但是進行到約 26 萬個 node 的排序時會出現 segmentation fault ,若 node 數量小於此則能正常運作。推測是遞迴太多次造成堆疊溢出。因為除了這一段程式碼採用遞迴外,拆分 linked list 的部分也採用遞迴。最後將 merge 改成以迭代實作。

struct list_head *merge_sort_list(struct list_head *head,
                                  struct list_head *real_head)
{
    if (!head || !head->next || head->next == real_head || head == real_head) {
        return head;
    }

    struct list_head *fast = head;
    struct list_head *slow = head;
    while (fast != real_head && fast->next != real_head) {
        fast = fast->next->next;
        slow = slow->next;
    }

    fast = slow;
    slow = slow->prev;
    slow->next = real_head;
    fast->prev = real_head;

    struct list_head *l1 = merge_sort_list(head, real_head);
    struct list_head *l2 = merge_sort_list(fast, real_head);
    return merge(l1, l2, real_head);
}

採用快慢指標找出正中間的節點以便拆分。由於一開傳入的 head 為第一個節點的位址而非整個 linked list 的 head ,因此選取中間節點時不須像 q_delete_mid 要取 slow 指到的下一個節點。

下圖為一開始尋找快慢節點的狀態。







G



e1

head

prev

next



e2

gerbil

prev

next



e1:e->e2:w






e3

bear

prev

next



e2:e->e3:w






e4

dolphin

prev

next



e3:e->e4:w






e4:s->e1:s






p1
fast



p1->e2





p2
slow



p2->e2





當迴圈結束時,得以下狀態, slow 指向正中間的節點,不需要再移動一次。







G



e1

head

prev

next



e2

gerbil

prev

next



e1:e->e2:w






e3

bear

prev

next



e2:e->e3:w






e4

dolphin

prev

next



e3:e->e4:w






e4:s->e1:s






p1
fast



p1->e4





p2
slow



p2->e3





接著要將 linked list 拆成兩半。先將 fast 改指向 slow 指向的節點,再分別以 headfast 指向這兩個的第一個節點,並將 head 此 list 尾端的 next 指向原本整個 linked list 的 head,即所傳入的 real_head ,同理 fast 指向節點之 prev 也要指向 real_head。拆分結果如下圖。







G



e1

head

prev

next



e2

gerbil

prev

next



e1:e->e2:w






e2:s->e1:s





e3

bear

prev

next



e3:s->e1:s





e4

dolphin

prev

next



e3:e->e4:w






e4:e->e1:w






p1
head



p1->e2





p2
fast



p2->e3





拆完後再對這兩個 linked list 進行拆分,之後呼叫 merge() 將拆分的兩個 linked list 合併。最後將合併好的開頭指標回傳。

void q_sort(struct list_head *head)
{
    if (!head || !head->next) {
        return;
    }
    head->next = merge_sort_list(head->next, head);
    struct list_head *tail = head;
    while (tail->next != head) {
        tail = tail->next;
    }
    head->prev = tail;
}

以上的 function 呼叫 merge_sort_list() 進行 merge sort. 將 head->next 傳入此function的目的是讓每個 linked list 結構一樣,不需要為拆分出來的 link list 額外增加一個 head 。 完成 merge sort 後使用迴圈找到最後一個節點的位址,將 head->prev 指向此節點。

引入 lib/list_sort.c

研讀 lib/list_sort.c

Latest commit 9dbbc3b on 8 Jul 2021

list_sort 嘗試達到

2:1 balanced merge 。 假設有兩個
2k
長度的 linked list,若後面有長度
2k
個元素會先合併,使長度比維持
2:1
。若 cache 能容納
3×2k
個元素,則這個方法能夠避免 cache thrashing 。 減少
0.2×n
次比較。

參數

__attribute__((nonnull(2,3)))
void list_sort(void *priv, struct list_head *head, list_cmp_func_t cmp)

首先,一開始的 __attribute__((nonnull(2,3))) ,在 6.33 Declaring Attributes of Functions 提及可幫助編譯器確認 function 的屬性,可幫助優化或檢查正確性。 在 6.33.1 Common Function Attributes 提到 nonenull 用來確認 function 參數中所指定的 pointer 不是 NULL。 以上方的程式碼為例, struct list_head *headlist_cmp_func_t cmp 皆不能為 NULL

  • priv : private data,對 list_sort 無用,會傳給 cmp
  • head : list 的開頭
  • cmp : 比較參數元素的大小,為 function pointer。 其宣告在 list_sort.h , 如下:
    ​​​​typedef int __attribute__((nonnull(2,3))) (*list_cmp_func_t)(void *,
    ​​​​        const struct list_head *, const struct list_head *);
    
    void * 是前面提及的 priv ,兩個 const struct list_head * 代表兩個要比較的 list_head,若前者 @a 要排在後者 @b 後面 (@a>@b),則回傳 >0 ,反之則回傳 <0

接下來看 list_sort 的程式

    struct list_head *list = head->next, *pending = NULL;
    size_t count = 0;	/* Count of pending */

    if (list == head->prev)	/* Zero or one elements */
        return;

    /* Convert to a null-terminated singly-linked list. */
    head->prev->next = NULL;

*list 指向第一個節點,而 *pending 先指向 NULL*pending 是一個 linked list 用來存放每一個 sublist 的第一個元素,其功用是存放每一個 sublist ,以便於完成

2:1 merge sort 。
首先檢查 element 個數是否超過兩個。接著將 linked list 最後一個節點之 next 指向 NULL ,使其成為 null-terminated singly-linked list 。此時 prev 依然連結著,只是後面另有用途。

merge list 的部分

do {
		size_t bits;
		struct list_head **tail = &pending;

		/* Find the least-significant clear bit in count */
		for (bits = count; bits & 1; bits >>= 1)
			tail = &(*tail)->prev;
		/* Do the indicated merge */
		if (likely(bits)) {
			struct list_head *a = *tail, *b = a->prev;

			a = merge(priv, cmp, b, a);
			/* Install the merged result in place of the inputs */
			a->prev = b->prev;
			*tail = a;
		}

		/* Move one element from input list to pending */
		list->prev = pending;
		pending = list;
		list = list->next;
		pending->next = NULL;
		count++;
	} while (list);

上方程式處理

2:1 merge 的部分。bits 是用來決定合併 sublist的時機,會檢查是否湊滿
2k
個 node。有的話,if (likely(bits)) 會成立,呼叫 merge 進行合併,其中 likely() 用於幫助編譯器優化,表示較有可能發生。 merge 好的 sublist 會是 singular linked list,每個 sublist 的開頭會用 prev 互相連在一起,形成 pending 所需要的樣式。 這裡巧妙運用 prev 使 pending list 不需要額外的空間和大量的時間串接 sublist。

for (bits = count; bits & 1; bits >>= 1) 會將 bit 持續右移,同時將 *tail 在sublist 中向 prev 方向移動,尋找可能可以 merge 的 sublist 。bit 右移後的結果會決定是否要進行 merge 。

下面以實際例子觀察 bit 的運作模式

初始狀態

count=0 ,進入迴圈前







G



head

head

prev

next



e1

1

prev

next



head:e->e1:w






e2

2

prev

next



e1:e->e2:w






e3

3

prev

next



e2:e->e3:w






e4

4

prev

next



e3:e->e4:w






null

NULL



e4:s->null





p1
list



p1->e1





p2
pending



p2->null





由於

count=0 ,不會進入 for (bits = count; bits & 1; bits >>= 1), 也不會進行 merge。因此次 for 迴圈執行完會變成下圖。
pending list : null <- 1 <- pending







G


cluster_0

sublist 1



head

head

prev

next



e1

1

prev

next



head:e->e1:w





null

NULL



e1:s->null





e1:s->null





e2

2

prev

next



e2:s->e1:wn





e3

3

prev

next



e2:e->e3:w






e4

4

prev

next



e3:e->e4:w






e4:s->null





p1
list



p1->e2





p2
pending



p2->e1





接著

count=1=012,此時會進入 for (bits = count; bits & 1; bits >>= 1) , 結束後
bits=002
, 由於
bits=0
, 也不會merge。
pending list : null <- 2 <- 1 <- pending







G


cluster_0

sublist 1


cluster_1

sublist 2



head

head

prev

next



e1

1

prev

next



head:next->e1





null

NULL



e1:s->null





e1:s->null





e2

2

prev

next



e2:s->e1:wn





e3

3

prev

next



e3:s->e2:w





e4

4

prev

next



e3:e->e4:w






e4:s->null





p1
list



p1->e3





p2
pending



p2->e2





接著

count=2=102 , 經過 for 迴圈
bits=10
,要進行 merge。此時 *tail 指向 2 這個節點,要將 21 進行 merge。







G


cluster_0

sublist 1


cluster_1

sublist 2



head

head

prev

next



e1

1

prev

next



head:next->e1





null

NULL



e1:s->null





e1:s->null





e2

2

prev

next



e2:s->e1:wn





e3

3

prev

next



e3:s->e2:w





e4

4

prev

next



e3:e->e4:w






e4:s->null





p1
list



p1->e3





p2
pending



p2->e2





p3
*tail



p3->e2





完成 merge 後
pending list : null <- (1->2) <- 3 <- pending







G


cluster_0

sublist 1


cluster_1

sublist 2



head

head

prev

next



e1

1

prev

next



head:next->e1





e2

2

prev

next



e1:e->e2:w





null

NULL



e1:s->null





e2:e->null





e3

3

prev

next



e3:s->e1:w





e3:e->null





e4

4

prev

next



e4:s->e3:w





e4:s->null





p1
list



p1->e4





p2
pending



p2->e3





count=3=112 ,跑完 for 迴圈後 bits = 0 不會進行 merge。







G


cluster_0

sublist 1


cluster_1

sublist 2


cluster_2

sublist 3



head

head

prev

next



e1

1

prev

next



head:next->e1





e2

2

prev

next



e1:e->e2:w





null

NULL



e1:s->null





e2:e->null





e3

3

prev

next



e3:s->e1:w





e3:e->null





e4

4

prev

next



e4:s->e3:w





e4:s->null





p1
list



p1->null





p2
pending



p2->e4





前面 do_while 迴圈結束會剩下一些 sublist ,如上圖。此時會用下方的程式,將其合併,並恢復成原本雙向環狀 linked list 。

    list = pending;
    pending = pending->prev;
    for (;;) {
    	struct list_head *next = pending->prev;
        
	    if (!next)
	    	break;
	    list = merge(priv, cmp, pending, list);
	    pending = next;
    }
    /* The final merge, rebuilding prev links */
    merge_final(priv, cmp, head, pending, list);

由以上的例子可以發現:

  • 每執行一次 do_while 迴圈,都會產生一個 長度為
    20
    的 sublist ,因此每跑一次迴圈 count 要加 1 , *tail 一開始也會指向這個 sublist 。
  • count
    的第 k bits 的值表示 pending list 各種 大小 sublist 的個數或狀態。若某位置 k bit 為 0 ,表示有
    2k
    的 sublist 要合併。如
    count=102
    ,表示有兩個長度為
    20
    的 sublist,現在要進行 merge。

如何確認 Linux 核心的 list_sort.c 有實際效益?你若只是將程式碼註解翻譯為中文,難道不會落得「舉燭」的下場嗎?

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
jserv

引入 list_sort.c 到 lab-0

參考 SmallHanley 的方式引入。 另外額外添加命令 linux_sortqtest

效能比較

先將 qtest.c 中的亂數種子改為 srand((unsigned int) (1)) ,確保測試時隨機產生的 value 相同。

採用的 sort.cmd 如下,會更改第 4 行的數量。

option fail 0 option malloc 0 new ih RAND 1000000 linux_sort free quit
number of nodes my sort(s) linux sort(s) ratio (linux_sort/my sort)
100000 0.0617 0.0537 0.87
500000 0.4214 0.3552 0.84
1000000 0.9632 0.7982 0.82
1500000 1.5651 1.3027 0.83
2000000 2.1369 1.7708 0.83

由上方表格可發現, linux 的 merge sort 執行時間是自己實作約 0.83 倍。因此接著觀察程式碼,發現自己的 merge sort 似乎花費較多時間在走訪節點。於是,我在程式碼走訪節點的部分放入 walk++ 來計算總共走訪了多少節點。其中快慢節點的部分 fast = fast->next->next 雖然走了兩次節點,且還有 slow = slow->next ,但先採計一次。

    while (fast != real_head && fast->next != real_head) {
        fast = fast->next->next;
        slow = slow->next;
        walk += 1;
    }

得到的結果如下

number of nodes my sort linux sort ratio (linux_sort/my sort)
100000 2451728 1742495 0.71
500000 14029659 9843037 0.70
1000000 29559837 20687070 0.70
1500000 45522709 32006121 0.70
2000000 62119026 43373290 0.70

在節點數量夠大的情形下, linux 的 merge sort 走訪節點的次數大約是自己實作 0.7 倍。這樣的差距可以部份解釋 linux 的 merge sort 執行時間減少 20%。

接下來我嘗試尋找是自己的程式是哪裡多出了這些執行次數。比較一下程式碼,可發現最大的差別是拆分 linked list 的方式。

struct list_head *merge_sort_list(struct list_head *head, struct list_head *real_head) { if (!head || !head->next || head->next == real_head || head == real_head) { return head; } struct list_head *fast = head; struct list_head *slow = head; while (fast != real_head && fast->next != real_head) { fast = fast->next->next; slow = slow->next; } fast = slow; slow = slow->prev; slow->next = real_head; fast->prev = real_head; struct list_head *l1 = merge_sort_list(head, real_head); struct list_head *l2 = merge_sort_list(fast, real_head); return merge(l1, l2, real_head); }

上方我的實作核心理念是將目前這個 linked list 拆解成兩半,再呼叫 merge 合併。 注意到第 8 到第 10 行是採用快慢節點的方式尋找正中央的節點。因此,直接計算這部分走訪的次數,得到以下結果:

number of nodes times of visiting the nodes
100000 815024
500000 4692496
1000000 9884992
1500000 15111216
2000000 20769984

接著,將第一份表格的走訪節點總數扣除這裡的數字,得到下方的結果:

number of nodes 我的實作節點走訪次數 - 快慢節點走訪次數 我的實作節點走訪次數 - 快慢節點走訪次數 + number of nodes linux sort
100000 1636707 1736707 1742495
500000 9334163 9834163 9843037
1000000 19674845 20674845 20687070
1500000 30411493 31911493 32006121
2000000 41349042 43349042 43373290

從上方的表格可以看出,扣除掉快慢節點的次數後,節點走訪次數和 linux sort 比較接近。考慮扣除快慢節點次數後便無分割 linked list ,因此加回節點數量,即走訪一次所有節點之次數,發現所走訪節點總數和 linux sort 非常接近,差約 1% ,由此可以推測,我的實作所多出來的節點走訪次數絕大多都是快慢節點造成。接著,再回去分析一次 linux 的 list_sort 程式碼,其核心部分如下:

do {
		size_t bits;
		struct list_head **tail = &pending;

		/* Find the least-significant clear bit in count */
		for (bits = count; bits & 1; bits >>= 1)
			tail = &(*tail)->prev;
		/* Do the indicated merge */
		if (likely(bits)) {
			struct list_head *a = *tail, *b = a->prev;

			a = merge(priv, cmp, b, a);
			/* Install the merged result in place of the inputs */
			a->prev = b->prev;
			*tail = a;
		}

		/* Move one element from input list to pending */
		list->prev = pending;
		pending = list;
		list = list->next;
		pending->next = NULL;
		count++;
	} while (list);

整個 do_while 迴圈主體指走訪了一次一整個 linked list ,若電腦為 32 bit ,其中的 for 迴圈單次最多走訪 32 個 sublist ,整體次數取決於節點總數。相較之下我的實作需要用快慢指標走訪半個 linked list ,每個拆分的 sublist 又要再走訪一次,顯得十分耗時。

Cache and Branch Prediction

接著參考 bakudr18 使用 cachegrind 觀察 cache 的使用與 branch prediction 。 cachegrind 會根據所使用的 CPU 模擬 cache 的使用情形分支預測(可選)的情形。這裡將分支的模擬開啟,指令如下:

valgrind --tool=cachegrind --branch-sim=yes ./qtest -v 3 -f sort.cmd

使用 cg_annotate 可以看 cachegrind 的輸出,若啟用 --auto=yes 則會在使用的程式碼旁附上這段程式之 cache 和 branch 的使用和預測情形。

在此我選擇 1000000 個節點,其中每個 element_t 大小為 24 byte ,不考慮 char *value 的字串大小的話,總共占用約 22 MiB ,比 CPU 之 L3 cache 大。

接著先看 linux sort:

$ cg_annotate --auto=yes cachegrind.out.3534 
--------------------------------------------------------------------------------
I1 cache:         32768 B, 64 B, 8-way associative
D1 cache:         32768 B, 64 B, 8-way associative
LL cache:         16777216 B, 64 B, 16-way associative
Command:          ./qtest -v 3 -f sort.cmd
Data file:        cachegrind.out.3534
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim
Thresholds:       0.1 100 100 100 100 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   
Auto-annotation:  on

--------------------------------------------------------------------------------
Ir            I1mr  ILmr  Dr          D1mr       DLmr       Dw          D1mw      DLmw      Bc          Bcm        Bi         Bim 
--------------------------------------------------------------------------------
2,763,992,174 1,680 1,647 671,986,191 44,236,822 22,041,031 312,109,186 5,460,862 5,011,973 370,880,162 24,611,501 58,376,931 344  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir          I1mr ILmr Dr         D1mr       DLmr      Dw         D1mw      DLmw      Bc         Bcm        Bi         Bim  file:function
--------------------------------------------------------------------------------
413,515,263    8    7 93,436,562 15,138,427 5,672,168          0         0         0 43,580,643  7,041,986          0   0  /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
350,594,924    3    3 88,003,584          3         0 33,001,344         0         0 54,647,387    709,722          0   0  /build/glibc-eX1tMB/glibc-2.31/stdlib/random_r.c:random_r
282,010,089   45   45 48,002,113          2         1 48,001,881 1,999,865 1,999,856 38,001,923        355          0   0  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_malloc
248,135,944    2    2 27,687,038  3,247,664   905,031 43,374,100         1         1 36,374,114 10,025,392 17,687,058   1  /home/hi/linux2022/lab0-c/list_sort.c:merge
231,009,408    2    2 88,003,584          0         0 22,000,896         0         0 33,001,344          1          0   0  /build/glibc-eX1tMB/glibc-2.31/stdlib/random.c:random
162,004,640   16   16 46,001,355     74,924    66,533 20,000,747         2         2 34,000,786         41          0   0  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_free
149,496,456    1    1 56,061,171 13,525,346 4,303,323 18,687,057         0         0          0          0          0   0  /home/hi/linux2022/lab0-c/queue.c:list_node_cmp
129,020,643    3    3 13,002,502          1         1 22,002,949         0         0  9,000,447  2,009,294          0   0  /home/hi/linux2022/lab0-c/qtest.c:fill_rand_string
 96,003,630    7    7 24,000,976          6         5  8,000,367         0         0 20,000,729         84          1   1  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:malloc
 80,000,039    3    3 20,000,010          2         2 24,000,011   249,947   249,947  8,000,004          1          0   0  /home/hi/linux2022/lab0-c/harness.c:test_malloc
 79,377,718   56   53 39,688,820         80        50         19         1         1         15          9 39,688,786 123  ???:???
 76,000,068    4    4 24,000,012  1,099,483   978,274 14,000,006 2,198,102 1,773,157 14,000,007         39          0   0  /home/hi/linux2022/lab0-c/harness.c:test_free
 55,595,204    5    5  4,000,003          0         0  8,000,012         2         2 10,797,585  2,410,659          0   0  /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms
 54,000,731    6    6  8,000,110  5,998,669 5,181,531  2,000,100         3         3 16,000,025         33          0   0  /home/hi/linux2022/lab0-c/qtest.c:show_queue
 49,002,502    3    3 14,002,502          0         0 11,000,000         0         0 18,005,004  1,000,027          0   0  /home/hi/linux2022/lab0-c/queue.c:q_insert_head
 45,002,235    1    1  9,000,447          0         0  9,000,447         0         0          0          0          0   0  /build/glibc-eX1tMB/glibc-2.31/stdlib/rand.c:rand
 42,024,309    7    7  6,999,966    522,385   500,192  8,999,988 1,011,210   987,933  6,000,011  1,406,745    999,999   1  /home/hi/linux2022/lab0-c/list_sort.c:list_sort
 42,001,428    2    2 10,000,340    125,024   109,588          0         0         0 10,000,340         13          0   0  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:free
 41,999,958    4    4 11,999,988  1,000,849 1,000,007          0         0         0  3,999,996          3          0   0  /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_l_avx
 34,000,076   10   10  7,000,026          2         2  4,000,014         0         0  7,000,020         23          0   0  /home/hi/linux2022/lab0-c/qtest.c:do_ih
 22,000,896    0    0 11,000,448          1         0          0         0         0          0          0          0   0  /build/glibc-eX1tMB/glibc-2.31/stdlib/../sysdeps/unix/sysv/linux/x86/lowlevellock.h:random
 13,799,986    3    3  3,000,484          0         0  2,000,303         5         5  3,399,589         84          0   0  /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms
 12,000,037    6    6  3,000,007  1,250,084 1,248,979  1,000,012         0         0  3,000,004          3          0   0  /home/hi/linux2022/lab0-c/qtest.c:do_linux_sort
 12,000,032    2    2  6,000,016          4         4  3,000,008         0         0          0          0          0   0  /home/hi/linux2022/lab0-c/harness.c:error_check
  9,000,000    1    1  3,000,000    249,863   203,325  3,000,000         0         0          0          0          0   0  /home/hi/linux2022/lab0-c/queue.c:q_release_element
  8,000,004    1    1  2,000,001          0         0  2,000,001         0         0          0          0          0   0  /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_free
  8,000,004    0    0          0          0         0  2,000,001         0         0          0          0          0   0  /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_malloc
  7,000,018    1    1  2,000,005    999,319   874,792  2,000,003         0         0  1,000,003          4          0   0  /home/hi/linux2022/lab0-c/queue.c:q_free
  4,000,009    1    1  1,000,002  1,000,001   994,580          0         0         0  1,000,002         10          0   0  /home/hi/linux2022/lab0-c/queue.c:q_size
  3,999,996    2    2  1,999,998          2         2          0         0         0          0          0          0   0  /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_avx
  3,000,000    0    0          0          0         0  1,000,000         0         0          0          0          0   0  /usr/include/x86_64-linux-gnu/bits/string_fortified.h:q_insert_head

其中和 linux sort 相關者如下:

--------------------------------------------------------------------------------
Ir          I1mr ILmr Dr         D1mr       DLmr      Dw         D1mw      DLmw      Bc         Bcm        Bi         Bim  file:function
--------------------------------------------------------------------------------
248,135,944    2    2 27,687,038  3,247,664   905,031 43,374,100         1         1 36,374,114 10,025,392 17,687,058   1  /home/hi/linux2022/lab0-c/list_sort.c:merge
149,496,456    1    1 56,061,171 13,525,346 4,303,323 18,687,057         0         0          0          0          0   0  /home/hi/linux2022/lab0-c/queue.c:list_node_cmp
 42,024,309    7    7  6,999,966    522,385   500,192  8,999,988 1,011,210   987,933  6,000,011  1,406,745    999,999   1  /home/hi/linux2022/lab0-c/list_sort.c:list_sort

我的實作:

--------------------------------------------------------------------------------
I1 cache:         32768 B, 64 B, 8-way associative
D1 cache:         32768 B, 64 B, 8-way associative
LL cache:         16777216 B, 64 B, 16-way associative
Command:          ./qtest - v -f sort.cmd
Data file:        cachegrind.out.3469
Events recorded:  Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim
Events shown:     Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim
Thresholds:       0.1 100 100 100 100 100 100 100 100 100 100 100 100
Include dirs:     
User annotated:   
Auto-annotation:  on

--------------------------------------------------------------------------------
Ir            I1mr  ILmr  Dr          D1mr       DLmr       Dw          D1mw      DLmw      Bc          Bcm        Bi         Bim 
--------------------------------------------------------------------------------
2,728,893,058 1,674 1,643 701,700,487 59,014,266 24,973,956 314,072,535 4,471,042 4,024,119 412,959,089 24,312,062 39,677,670 340  PROGRAM TOTALS

--------------------------------------------------------------------------------
Ir          I1mr ILmr Dr         D1mr       DLmr      Dw         D1mw      DLmw      Bc         Bcm        Bi         Bim  file:function
--------------------------------------------------------------------------------
413,257,307    8    7 93,375,558 13,547,553 4,606,231          0         0         0 43,554,992  7,002,033          0   0  /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-avx2.S:__strcmp_avx2
350,594,924    3    3 88,003,584          3         0 33,001,344         0         0 54,647,387    709,722          0   0  /build/glibc-eX1tMB/glibc-2.31/stdlib/random_r.c:random_r
282,010,089   45   45 48,002,113          2         1 48,001,881 1,999,865 1,999,856 38,001,923        356          0   0  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_malloc
279,101,035    4    4 79,699,375 15,183,636 4,208,442 62,024,529     4,096         8 59,527,376 10,500,506          0   0  /home/hi/linux2022/lab0-c/queue.c:merge
231,009,408    2    2 88,003,584          0         0 22,000,896         0         0 33,001,344          1          0   0  /build/glibc-eX1tMB/glibc-2.31/stdlib/random.c:random
162,004,640   15   15 46,001,355     74,924    66,533 20,000,747         2         2 34,000,786         40          0   0  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:_int_free
129,020,643    3    3 13,002,502          1         1 22,002,949         0         0  9,000,447  2,009,294          0   0  /home/hi/linux2022/lab0-c/qtest.c:fill_rand_string
121,739,231    2    2 39,836,411 17,478,366 4,499,174 10,999,992    17,294        70 23,951,422    672,165          0   0  /home/hi/linux2022/lab0-c/queue.c:merge_sort_list
 96,003,622    7    7 24,000,976          6         5  8,000,367         0         0 20,000,729         81          1   1  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:malloc
 80,000,039    3    3 20,000,010          2         2 24,000,011   249,947   249,947  8,000,004          1          0   0  /home/hi/linux2022/lab0-c/harness.c:test_malloc
 79,353,310   55   53 39,676,616      1,786        57         19         1         1         15          9 39,676,582 121  ???:???
 76,000,068    4    4 24,000,012  1,099,483   978,274 14,000,006 2,198,101 1,773,157 14,000,007         40          0   0  /home/hi/linux2022/lab0-c/harness.c:test_free
 55,595,204    5    5  4,000,003          0         0  8,000,012         2         2 10,797,585  2,410,659          0   0  /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:__memset_avx2_unaligned_erms
 54,000,731    6    6  8,000,110  5,998,669 5,181,531  2,000,100         3         3 16,000,025         33          0   0  /home/hi/linux2022/lab0-c/qtest.c:show_queue
 49,002,502    3    3 14,002,502          0         0 11,000,000         0         0 18,005,004  1,000,027          0   0  /home/hi/linux2022/lab0-c/queue.c:q_insert_head
 45,002,235    1    1  9,000,447          0         0  9,000,447         0         0          0          0          0   0  /build/glibc-eX1tMB/glibc-2.31/stdlib/rand.c:rand
 42,001,428    2    2 10,000,340    125,024   109,588          0         0         0 10,000,340         13          0   0  /build/glibc-eX1tMB/glibc-2.31/malloc/malloc.c:free
 41,999,958    4    4 11,999,988  1,000,848 1,000,007          0         0         0  3,999,996          3          0   0  /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_l_avx
 34,000,076   10   10  7,000,026          2         2  4,000,014         0         0  7,000,020         23          0   0  /home/hi/linux2022/lab0-c/qtest.c:do_ih
 22,000,896    0    0 11,000,448          1         0          0         0         0          0          0          0   0  /build/glibc-eX1tMB/glibc-2.31/stdlib/../sysdeps/unix/sysv/linux/x86/lowlevellock.h:random
 13,799,988    3    3  3,000,484          0         0  2,000,303         5         5  3,399,590         83          0   0  /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms
 12,000,037    6    6  3,000,007  1,250,084 1,249,154  1,000,012         0         0  3,000,004          4          0   0  /home/hi/linux2022/lab0-c/qtest.c:do_sort
 12,000,032    2    2  6,000,016          4         4  3,000,008         0         0          0          0          0   0  /home/hi/linux2022/lab0-c/harness.c:error_check
  9,000,000    1    1  3,000,000    249,863   203,325  3,000,000         0         0          0          0          0   0  /home/hi/linux2022/lab0-c/queue.c:q_release_element
  8,000,004    1    1  2,000,001          0         0  2,000,001         0         0          0          0          0   0  /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_free
  8,000,004    0    0          0          0         0  2,000,001         0         0          0          0          0   0  /usr/include/x86_64-linux-gnu/bits/string_fortified.h:test_malloc
  7,000,018    1    1  2,000,005    999,319   874,792  2,000,003         0         0  1,000,003          4          0   0  /home/hi/linux2022/lab0-c/queue.c:q_free
  4,000,019    1    1  1,000,004  1,000,002   999,602          4         2         2  1,000,003          7          0   0  /home/hi/linux2022/lab0-c/queue.c:q_sort
  4,000,009    1    1  1,000,002  1,000,001   994,580          0         0         0  1,000,002         10          0   0  /home/hi/linux2022/lab0-c/queue.c:q_size
  3,999,996    2    2  1,999,998          2         2          0         0         0          0          0          0   0  /build/glibc-eX1tMB/glibc-2.31/string/../sysdeps/x86_64/multiarch/strcmp-sse42.S:__strcasecmp_avx
  3,000,000    0    0          0          0         0  1,000,000         0         0          0          0          0   0  /usr/include/x86_64-linux-gnu/bits/string_fortified.h:q_insert_head

和我的實作有關的部分:

--------------------------------------------------------------------------------
Ir          I1mr ILmr Dr         D1mr       DLmr      Dw         D1mw      DLmw      Bc         Bcm        Bi         Bim  file:function
--------------------------------------------------------------------------------
279,101,035    4    4 79,699,375 15,183,636 4,208,442 62,024,529     4,096         8 59,527,376 10,500,506          0   0  /home/hi/linux2022/lab0-c/queue.c:merge
121,739,231    2    2 39,836,411 17,478,366 4,499,174 10,999,992    17,294        70 23,951,422    672,165          0   0  /home/hi/linux2022/lab0-c/queue.c:merge_sort_list
  4,000,019    1    1  1,000,004  1,000,002   999,602          4         2         2  1,000,003          7          0   0  /home/hi/linux2022/lab0-c/queue.c:q_sort

整體而言,我的實作的分支預測錯誤較少,但是 cache miss 數量較高。經過加總,我的 cache miss 總共約有 92,486,700 ,而 linux 的 merge sort 有 76,754,015 ,依據 Cachegrind 的說法, L1 cache miss 要付出約 10 cycle 代價, LL cache miss 要付出約 200 cycle 計算,我的 code 在 cache miss 方面需要多付出約 526889580 cycle , 假設 CPU 是在約 4GHz 運行,則會有約 0.13 秒的差距。

--------------------------------------------------------------------------------
Ir            I1mr  ILmr  Dr          D1mr       DLmr       Dw          D1mw      DLmw      Bc          Bcm        Bi         Bim 
--------------------------------------------------------------------------------

          .    .    .         .         .       .          .         .       .          .         .          .   .  __attribute__((nonnull(2, 3))) void list_sort(void *priv,
          .    .    .         .         .       .          .         .       .          .         .          .   .                                                struct list_head *head,
          .    .    .         .         .       .          .         .       .          .         .          .   .                                                list_cmp_func_t cmp)
         13    1    1         1         0       0          8         0       0          0         0          0   0  {
          2    1    1         1         0       0          1         0       0          0         0          0   0      struct list_head *list = head->next, *pending = NULL;
          2    0    0         0         0       0          0         0       0          0         0          0   0      size_t count = 0; /* Count of pending */
          .    .    .         .         .       .          .         .       .          .         .          .   .  
          5    0    0         1         0       0          0         0       0          1         0          0   0      if (list == head->prev) /* Zero or one elements */
          .    .    .         .         .       .          .         .       .          .         .          .   .          return;
          .    .    .         .         .       .          .         .       .          .         .          .   .  
          .    .    .         .         .       .          .         .       .          .         .          .   .      /* Convert to a null-terminated singly-linked list. */
          1    0    0         0         0       0          1         0       0          0         0          0   0      head->prev->next = NULL;
          .    .    .         .         .       .          .         .       .          .         .          .   .  
          .    .    .         .         .       .          .         .       .          .         .          .   .      do {
          .    .    .         .         .       .          .         .       .          .         .          .   .          size_t bits;
    999,999    0    0         0         0       0          0         0       0          0         0          0   0          struct list_head **tail = &pending;
          .    .    .         .         .       .          .         .       .          .         .          .   .  
          .    .    .         .         .       .          .         .       .          .         .          .   .          /* Find the least-significant clear bit in count */
  5,499,977    0    0         0         0       0          0         0       0  1,999,992   908,889          0   0          for (bits = count; bits & 1; bits >>= 1)
    999,993    0    0   999,993     1,946       1          0         0       0          0         0          0   0              tail = &(*tail)->prev;
          .    .    .         .         .       .          .         .       .          .         .          .   .          /* Do the indicated merge */
  1,999,998    0    0         0         0       0          0         0       0    999,999        25          0   0          if (likely(bits)) {
  1,999,960    0    0 1,999,960     5,829       5          0         0       0          0         0          0   0              struct list_head *a = *tail, *b = a->prev;
          .    .    .         .         .       .          .         .       .          .         .          .   .  
  3,999,920    0    0         0         0       0    999,980         1       1          0         0          0   0              a = merge(priv, cmp, b, a);
          .    .    .         .         .       .          .         .       .          .         .          .   .              /* Install the merged result in place of the inputs */
  1,999,960    0    0   999,980     6,635       8    999,980     3,398       4          0         0          0   0              a->prev = b->prev;
    999,980    0    0         0         0       0    999,980     7,798      11          0         0          0   0              *tail = a;
          .    .    .         .         .       .          .         .       .          .         .          .   .          }
          .    .    .         .         .       .          .         .       .          .         .          .   .  
          .    .    .         .         .       .          .         .       .          .         .          .   .          /* Move one element from input list to pending */
  2,000,000    1    1 1,000,000     7,798      11  1,000,000 1,000,000 987,913          0         0          0   0          list->prev = pending;
  1,000,000    0    0         0         0       0  1,000,000         0       0          0         0          0   0          pending = list;
  1,999,999    0    0 1,000,000   250,081 250,081          0         0       0          0         0          0   0          list = list->next;
  1,000,000    0    0         0         0       0  1,000,000         0       0          0         0          0   0          pending->next = NULL;
  1,999,999    1    1         0         0       0          0         0       0          0         0          0   0          count++;
  2,000,000    0    0         0         0       0          0         0       0  1,000,000         1          0   0      } while (list);
          .    .    .         .         .       .          .         .       .          .         .          .   .  
          .    .    .         .         .       .          .         .       .          .         .          .   .      /* End of input; merge together all the pending lists. */
          .    .    .         .         .       .          .         .       .          .         .          .   .      list = pending;
          1    0    0         0         0       0          1         0       0          0         0          0   0      pending = pending->prev;
          .    .    .         .         .       .          .         .       .          .         .          .   .      for (;;) {
         37    0    0        19        12       2          0         0       0          0         0          0   0          struct list_head *next = pending->prev;
          .    .    .         .         .       .          .         .       .          .         .          .   .  
         38    1    1         0         0       0          0         0       0         19        13          0   0          if (!next)
          .    .    .         .         .       .          .         .       .          .         .          .   .              break;
        108    0    0         0         0       0         18         0       0          0         0          0   0          list = merge(priv, cmp, pending, list);
         18    0    0         0         0       0         18        11       2          0         0          0   0          pending = next;
          .    .    .         .         .       .          .         .       .          .         .          .   .      }
          .    .    .         .         .       .          .         .       .          .         .          .   .      /* The final merge, rebuilding prev links */
          .    .    .         .         .       .          .         .       .          .         .          .   .      merge_final(priv, cmp, head, pending, list);
         11    0    0         9         2       2          0         0       0          1         0          0   0  }

上方程式碼為 list_sort 程式碼細部的 cache miss ,可觀察到大部分的 cache miss 發生在指標的操作上, 如 : list = list->next 。另外在 merge()merge_final 中也是一樣,像是下方 merge() 第 14 行與 22 行一樣,推測可能是指標不需要連續記憶體,當總資料量較大時,就需要不時將資料從記憶體搬入 cache 中,且每一次更新不一定能將大量節點搬入 cacheline ,從而造成 cachemiss。另一大部分的 cache miss 發生在節點值的比較,在 qeueu.c 中的 list_node_cmp 也發生了大量的 cache miss ,可能是其位址和節點的位址不連續而不在 cache 中。而大部分分支預測錯誤發生在比較兩個節點值的大小,節點位址是否為 NULL ,以及 do_while 中的 for 迴圈。猜測是因為節點值的大小本來就是隨機,不好預測,而 for 迴圈因為機制較特殊而不易預測其是否達成結束條件。

-------------------------------------------------------------------------------- Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw Bc Bcm Bi Bim . . . . . . . . . . . . . __attribute__((nonnull(2, 3, 4))) static struct list_head * . . . . . . . . . . . . . merge(void *priv, list_cmp_func_t cmp, struct list_head *a, struct list_head *b) 12,999,974 1 1 999,998 0 0 5,999,988 0 0 0 0 0 0 { 2,999,994 0 0 0 0 0 999,998 0 0 0 0 0 0 struct list_head *head = NULL, **tail = &head; . . . . . . . . . . . . . . . . . . . . . . . . . . for (;;) { . . . . . . . . . . . . . /* if equal, take 'a' -- important for sort stability */ 106,122,348 0 0 0 0 0 17,687,058 1 1 17,687,058 8,849,057 17,687,058 1 if (cmp(priv, a, b) <= 0) { 8,891,302 0 0 0 0 0 8,891,302 0 0 0 0 0 0 *tail = a; 8,891,302 0 0 0 0 0 0 0 0 0 0 0 0 tail = &a->next; 25,672,336 0 0 8,891,302 1,744,985 557,879 0 0 0 0 0 0 0 a = a->next; 17,782,604 0 0 0 0 0 0 0 0 8,891,302 573,905 0 0 if (!a) { 500,785 0 0 0 0 0 500,785 0 0 0 0 0 0 *tail = b; 500,785 0 0 0 0 0 0 0 0 0 0 0 0 break; . . . . . . . . . . . . . } . . . . . . . . . . . . . } else { 8,795,756 0 0 0 0 0 8,795,756 0 0 0 0 0 0 *tail = b; 8,795,756 1 1 0 0 0 0 0 0 0 0 0 0 tail = &b->next; 17,092,299 0 0 8,795,756 1,487,068 347,126 0 0 0 0 0 0 0 b = b->next; 17,591,512 0 0 0 0 0 0 0 0 8,795,756 602,430 0 0 if (!b) { 499,213 0 0 0 0 0 499,213 0 0 0 0 0 0 *tail = a; . . . . . . . . . . . . . break; . . . . . . . . . . . . . } . . . . . . . . . . . . . } . . . . . . . . . . . . . } 999,998 0 0 999,998 7,807 13 0 0 0 0 0 0 0 return head; 9,999,980 0 0 7,999,984 7,804 13 0 0 0 999,998 0 0 0 }

接下來是我的實作部分,這邊受限於長度只放 merge_sort_list()

Ir         I1mr ILmr Dr         D1mr       DLmr      Dw         D1mw   DLmw Bc         Bcm       Bi Bim 


         .    .    .          .          .         .          .      .    .          .         .  .   .  struct list_head *merge_sort_list(struct list_head *head,
         .    .    .          .          .         .          .      .    .          .         .  .   .                                    struct list_head *real_head)
11,999,994    1    1          0          0         0  5,999,997      0    0          0         0  0   0  {
19,999,989    0    0  1,999,999      6,231        15          0      0    0  4,999,997        11  0   0      if (!head || !head->next || head->next == real_head || head == real_head) {
         .    .    .          .          .         .          .      .    .          .         .  .   .          return head;
         .    .    .          .          .         .          .      .    .          .         .  .   .      }
         .    .    .          .          .         .          .      .    .          .         .  .   .  
         .    .    .          .          .         .          .      .    .          .         .  .   .      struct list_head *fast = head;
   999,999    0    0          0          0         0          0      0    0          0         0  0   0      struct list_head *slow = head;
46,969,283    0    0  9,066,433  6,233,411 1,872,535          0      0    0 18,951,425   672,154  0   0      while (fast != real_head && fast->next != real_head) {
 9,884,992    0    0  9,884,992  6,237,438 1,869,199          0      0    0          0         0  0   0          fast = fast->next->next;
 9,884,992    1    1  9,884,992  4,744,205   519,424          0      0    0          0         0  0   0          slow = slow->next;
         .    .    .          .          .         .          .      .    .          .         .  .   .      }
         .    .    .          .          .         .          .      .    .          .         .  .   .  
         .    .    .          .          .         .          .      .    .          .         .  .   .      fast = slow;
   999,999    0    0    999,999    251,620   237,996          0      0    0          0         0  0   0      slow = slow->prev;
   999,999    0    0          0          0         0    999,999      0    0          0         0  0   0      slow->next = real_head;
   999,999    0    0          0          0         0    999,999      0    0          0         0  0   0      fast->prev = real_head;
         .    .    .          .          .         .          .      .    .          .         .  .   .  
 3,999,996    0    0          0          0         0    999,999 17,294   70          0         0  0   0      struct list_head *l1 = merge_sort_list(head, real_head);
 3,999,996    0    0          0          0         0    999,999      0    0          0         0  0   0      struct list_head *l2 = merge_sort_list(fast, real_head);
 2,999,997    0    0          0          0         0    999,999      0    0          0         0  0   0      return merge(l1, l2, real_head);
 7,999,996    0    0  7,999,996      5,461         5          0      0    0          0         0  0   0  }

可以觀察到和 linux sort 一樣,在操作指標部分和比較節點值的部分有大量的 cache miss 。然而,在快慢節點的部分也出現了大量 cache miss 。 Branch miss prediction 發生在節點值的比較以及走訪上,即 merge() 中,快慢節點也有發生,但量不大。

小結

以上從結點走訪次數、 cache miss 次數以及分支預測討論 linux 的 list_sort 和我實作的差別,可發現無論是走訪次數還是 cache miss 次數 前者都較少,只有在分支預測的錯誤數量小輸一些。以 CPU 的觀點而言,分支預測錯誤頂多是多消耗數十 cycle ,但是 cache miss 所衍生的成本非常的大,可達數百 cycle ,因此對於 Linux 核心而言,減少 cache miss 是當務之急,從上方的測試也看出其對 cache 較友善。

Todo

在 qtest 提供新的命令 shuffle

嘗試在 qtest.cconsole_init() 加入 shuffle,使執行 help 命令 指令 時顯示 shuffle 。

command 是「命令」,instruction 是「指令」。二者語意根本不同!

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
jserv

static void console_init()
{
    ADD_COMMAND(new, "                | Create new queue");
    ....
    ....
    ....
    ADD_COMMAND(shuffle,
                "                | Shuffle");
    add_param("length", &string_length, "Maximum length of displayed string",
              NULL);
    add_param("malloc", &fail_probability, "Malloc failure probability percent",
              NULL);
    add_param("fail", &fail_limit,
              "Number of times allow queue operations to return false", NULL);
}

Make file 時產生以下錯誤

In file included from qtest.c:36:
qtest.c: In function ‘console_init’:
console.h:46:45: error: ‘do_shuffle’ undeclared (first use in this function)
   46 | #define ADD_COMMAND(cmd, msg) add_cmd(#cmd, do_##cmd, msg)
      |                                             ^~~
qtest.c:798:5: note: in expansion of macro ‘ADD_COMMAND’
  798 |     ADD_COMMAND(shuffle,
      |     ^~~~~~~~~~~
console.h:46:45: note: each undeclared identifier is reported only once for each function it appears in
   46 | #define ADD_COMMAND(cmd, msg) add_cmd(#cmd, do_##cmd, msg)
      |                                             ^~~
qtest.c:798:5: note: in expansion of macro ‘ADD_COMMAND’
  798 |     ADD_COMMAND(shuffle,
      |     ^~~~~~~~~~~
make: *** [Makefile:50: qtest.o] Error 1

從以上錯誤訊息可知 ADD_COMMAND 是一個巨集,會出現錯誤是因為沒有實作 do_##cmd 的 function。因此先實作一個簡單的 do_shuffle() ,在 cmd 執行時印出字串。

static bool do_shuffle(int argc, char *argv[]){
    if (argc != 1) {
        report(1, "%s takes no arguments", argv[0]);
        return false;
    } else {
        printf("Here is shuffle.\n");
    }
    return true;
}

執行 help 命令,在自 19 行出現 shuffle。

cmd> help Commands: # ... | Display comment dedup | Delete all nodes that have duplicate string dm | Delete middle node in queue free | Delete queue help | Show documentation ih str [n] | Insert string str at head of queue n times. Generate random string(s) if str equals RAND. (default: n == 1) it str [n] | Insert string str at tail of queue n times. Generate random string(s) if str equals RAND. (default: n == 1) log file | Copy output to file new | Create new queue option [name val] | Display or set options quit | Exit program reverse | Reverse queue rh [str] | Remove from head of queue. Optionally compare to expected value str rhq | Remove from head of queue without reporting value. rt [str] | Remove from tail of queue. Optionally compare to expected value str show | Show queue contents shuffle | Shuffle size [n] | Compute queue size n times (default: n == 1) sort | Sort queue in ascending order source file | Read commands from source file swap | Swap every two adjacent nodes in queue time cmd arg ... | Time command execution Options: echo 1 Do/don't echo commands error 5 Number of errors until exit fail 30 Number of times allow queue operations to return false length 1024 Maximum length of displayed string malloc 0 Malloc failure probability percent simulation 0 Start/Stop simulation mode verbose 4 Verbosity level

執行 shuffle ,成功印出字串。

cmd> shuffle
Here is shuffle.

接著進行 shuffle 功能實作,採用 Fisher–Yates shuffle 演算法。

struct list_head *find_node(struct list_head *head, int index, int size_q)
{
    struct list_head *target = NULL;
    target = head;
    if (index <= ((size_q >> 1) + 1)) {
        for (int i = 0; i <= index; ++i) {
            target = target->next;
        }
    } else {
        for (int i = size_q - 1; i >= index; --i) {
            target = target->prev;
        }
    }
    return target;
}

上方的程式碼是用來尋找節點的位址。為了充分運用 doubly circular linked list 的特性,若尋找的節點在後半部,會從尾端開始找,反之會從 head 尋找。

void q_shuffle(struct list_head *head)
{
    int s = q_size(head);
    if (!head || !head->next || head->next == head ||
        head->next->next == head) {
        return;
    }
    srand(time(NULL));
    int j = 0;
    struct list_head *node1 = head->prev, *tmp = NULL;
    struct list_head *node2 = NULL;
    for (int i = s - 1; i >= 1; --i) {
        j = rand() % (i + 1);
        tmp = node1->prev;
        node2 = find_node(head, j, s);
        node1 = find_node(head, i, s);
        if (i == j) {
            node1 = tmp;
            continue;
        }
        list_del(node1);
        list_add(node1, node2->prev);
        if (tmp != node2) {
            list_del(node2);
            list_add(node2, tmp);
            node1 = tmp;
        }
    }
}

採用 for 迴圈用 node1 從 linked list 尾端開始走訪每個節點,使用 rand() 隨機選出此節點前隨機一個節點 node2 並交換,持續走訪到第二個節點便結束整個演算法。

一開始狀態如下圖,隨機選出第 3 個節點交換。







G



e1

head

prev

next



e2

1

prev

next



e1:e->e2:w






e3

2

prev

next



e2:e->e3:w






e4

3

prev

next



e3:e->e4:w






e5

4

prev

next



e4:e->e5:w






e5:s->e1:s






p1
node1



p1->e5





p2
node2



p2->e4





p3
tmp



p3->e4





先使用 list_del 移除 node1tmp 指向 ,在使用 list_addnode2->prev 為開頭將 node1 插入 node2 前方。







G



e1

head

prev

next



e2

1

prev

next



e1:e->e2:w






e3

2

prev

next



e2:e->e3:w






e5

4

prev

next



e3:e->e5:w






e4

3

prev

next



e4:s->e1:s






e5:e->e4:w






p1
node1



p1->e5





p2
node2



p2->e4





p3
tmp



p3->e4





此時判斷 tmpnode2 位址是否相等。若相等,如上圖一樣,則完成這一次交換。若否,則使用 list_del 移除 node2 並用 list_addtmp 為開頭插入。

然而以上步驟需判斷 node2tmp 是否相等,增加分支預測。經過修改後,採用不同的方式所減了這個判斷條件。程式碼如下。

void q_shuffle(struct list_head *head)
{
    int s = q_size(head);
    if (!head || !head->next || head->next == head ||
        head->next->next == head) {
        return;
    }
    srand(time(NULL));
    int j = 0;
    struct list_head *node1 = head->prev, *tmp = NULL;
    struct list_head *node2 = NULL;
    for (int i = s - 1; i >= 1; --i) {
        j = rand() % (i + 1);
        node2 = find_node(head, j, s);
        if (i == j) {
            node1 = node1->prev;
            continue;
        }
        
        tmp = node2->prev;
        list_del(node2);
        list_add(node2,node1);
        list_del(node1);
        list_add(node1, tmp);
        node1 = node2->prev;
    }
}

主要的差別在於 tmp 改成指向 node2->prev,且先移除 node2 ,插入 node1 後方,再移除 node1 並插入 tmp 指向節點之後方。若遇到 node1node2 相鄰時,不須像之前一樣判斷,前述的 node1 之操作會變成插入原本之位置。考慮到隨機取亂數時,node1node2 相鄰機率不高,這樣的修改也許有助於效能。

一開始狀態如下圖,隨機選出第 3 個節點交換。







G



e1

head

prev

next



e2

1

prev

next



e1:e->e2:w






e3

2

prev

next



e2:e->e3:w






e4

3

prev

next



e3:e->e4:w






e5

4

prev

next



e4:e->e5:w






e5:s->e1:s






p1
node1



p1->e5





p2
node2



p2->e4





p3
tmp



p3->e3





node2 插入 node1 後方







G



e1

head

prev

next



e2

1

prev

next



e1:e->e2:w






e3

2

prev

next



e2:e->e3:w






e5

4

prev

next



e3:e->e5:w






e4

3

prev

next



e4:s->e1:s






e5:e->e4:w






p1
node1



p1->e5





p2
node2



p2->e4





p3
tmp



p3->e3





移除 node1







G



e1

head

prev

next



e2

1

prev

next



e1:e->e2:w






e3

2

prev

next



e2:e->e3:w






e4

3

prev

next



e3:e->e4:w






e4:s->e1:s





e5

4

prev

next



e5:s->e1:ws





e5->e4





p1
node1



p1->e5





p2
node2



p2->e4





p3
tmp



p3->e3





插入 tmp 後方







G



e1

head

prev

next



e2

1

prev

next



e1:e->e2:w






e3

2

prev

next



e2:e->e3:w






e5

4

prev

next



e3:e->e5:w






e4

3

prev

next



e4:s->e1:s






e5:e->e4:w






p1
node1



p1->e5





p2
node2



p2->e4





p3
tmp



p3->e3





若選到相鄰節點,只是做了一次無用插入。

效能比較

下方是改善後的運行時間,7 次執行平均花費 0.845 秒

cmd> new
l = []
cmd> ih 1 20000
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 0.683
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 0.879
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 0.874
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 0.866
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 0.856
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 0.876
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 0.881
cmd> quit
Freeing queue

下方是原本程式的運行時間,平均約 0.857 秒。

cmd> new
l = []
cmd> ih 1 20000
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 0.674
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 0.912
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 0.891
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 0.871
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 0.894
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 0.895
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 0.867

經過比較後發現實際執行時間新版和舊版但差距非常小,幾乎可視為誤差。推測是因為選到鄰近倆的機率實在很小,分支預測若是以 taken 為主,其失敗的次數很低,造成損失很小。

假設整個 linked list 共有

n 個節點,且選取每個節點的機率是一樣的。跑第一次迴圈時,會選取整個 linked list 中隨機一點和最後面節點
n
交換,但剛好選到第
n1
節點之機率為
1n
。同理,下一次迴圈選到剛好旁邊的節點的機率為
1n1
。直到第 2 個節點,選取到相鄰節點的機率為
12
。因此,整個長度為
n
的 linked list 有相鄰節點的期望值為
k=2n1k×1
,當
n=20000
時為
9.48073
,大約會選到 9 個點是鄰近的點,有 2000 倍的差距。

許多 random permutation 的討論忽略個別單元的存取成本,但在 linked list 上無論是節點交換,還是走訪鄰近節點,都不能忽略其成本。
以 Android 為例,系統內建 Low Memory Killer Daemon,目的是在系統可用記憶體低於某個數值時,嘗試強制釋放系統資源,而在 Linux 核心中,所有的 task (包含 process 及 thread) 均以 circular doubly-linked list 來保存和操作,某些 Android 裝置這對 low memory killer 還會擴展為「隨機釋放某些任務」這樣的操作,從而允許在更低硬體規格的環境中運作 Android,這樣光是找到某些任務,對系統就是可觀的成本,尤其是系統資源已不充分時。
上述討論可以很實際,甚至擴充後可用以 model 複雜系統的行為。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →
jserv

然而,比較程式碼可發現,新版程式雖然移除了條件判斷,卻多做一次節點的刪除與插入。每一次的插入和刪除的成本不低,省去了分支預測錯誤的懲罰卻增加操作節點的成本,因此這樣的改善不一定的增加效率。
以下測試較為極端的狀況,每一次交換的點必在旁邊。

舊版,平均 4.509 秒:

cmd> new
l = []
cmd> ih 1 60000
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.513
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.474
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.543
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.520
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.493
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.494
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.528
cmd> quit
Freeing queue

新版,平均約 4.5457 秒

cmd> new
l = []
cmd> ih 1 60000
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.533
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.507
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.495
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.565
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.492
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.699
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.559
cmd> quit
Freeing queue

以下是直接做鄰近節點交換,沒有分支,平均約 4.493 秒

cmd> new
l = []
cmd> ih 1 60000
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.476
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.489
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.505
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.487
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.506
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.495
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 4.498
cmd> quit
Freeing queue

從以上的測試可發現, 差一個 list_dellist_add 的操作差距相較於分支明顯不少,減少約 0.04 秒,而沒有分支的和有分支的情形約差 0.01 秒。因此,這次減少分支的改動對於效能的改善可說是不明顯。

另一個優化思路

對 linked list 的節點操作成本不小,注意到原本程式碼搜尋要交換節點位置是採用從頭或最尾端開始尋找。然而,Fisher–Yates shuffle 演算法的節點會從尾端開始逐一向前,顯然原本程式並沒有利用這一點。因此對於節點搜尋的方式稍作修改,如下。

void q_shuffle(struct list_head *head)
{
    int s = q_size(head);
    if (!head || !head->next || head->next == head ||
        head->next->next == head) {
        return;
    }
    int j = 0;
    struct list_head *node1 = head->prev, *tmp = NULL;
    struct list_head *node2 = NULL;
    struct list_head *tail = head->prev;
    for (int i = s - 1; i >= 1; --i) {
        j = rand() % (i + 1);

        if (i == j) {
            node1 = node1->prev;
            continue;
        }
        head->prev = node1;
        node2 = find_node(head, j, i + 1);

        tmp = node2->prev;
        list_del(node2);
        list_add(node2, node1);
        list_del(node1);
        list_add(node1, tmp);
        node1 = node2->prev;
        if (i == s - 1){
            tail = node2;
        }
        
    }
    head->prev = tail;
}

tail 指向整個 lined list 最尾端的節點,並在第一次迴圈時記下最尾端節點之位置。接著每次迴圈將 head->prev 指向 node1 ,使其成為整個 linked list 之暫時尾端。隨著迴圈執行,搜索範圍逐漸減少,可以減少搜尋間。

執行結果如下:

舊版:

cmd> new
l = []
cmd> ih 1 40000
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 2.835
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 3.578
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 3.653
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 3.714
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 3.690
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 3.690
cmd> time shuffle
cmd> quit
Freeing queue

新版:

cmd> new
l = []
cmd> ih 1 40000
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 1.552
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 2.238
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 2.474
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 2.273
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 2.295
cmd> time shuffle
l = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... ]
Delta time = 2.283
cmd> quit
Freeing queue

從上方結果很明顯看出節省了約 1 秒的時間,相較於之前對於分支減少的改動,減少走訪節點數量顯著減少執行時間。