2021q1 Homework2 (quiz2)

# 2021q1 Homework2 (quiz2) contributed by < [`linD026`](https://github.com/linD026) > ###### tags: `linux2021` > [2021 年第 2 週隨堂測驗題目](https://hackmd.io/@sysprog/linux2021-quiz2) --- ## 0 進度 - [x] 測驗 `1` - [ ] list_sort.c ( Linux ) 圖形驗證 - [x] 測驗 `2` - [ ] percpu 說明要完善 - [x] 測驗 `3` - [ ] 改寫可以更好 - [ ] 說明要完善 - [ ] 測驗 `4` - [ ] release - [ ] 實驗不夠完善 - thread 太少 - ref 問題 --- ## 測驗 `1` ## 1 解釋程式運作原理與可改進 ### 結構定義 * 此 linked list 是以 queue_t 作為函式之間的主要操作物件。 * 但在實際進行雙向鏈結串列時的插入與排序等操作，**幾乎**是以 list_head 為主。 * 還有一個結構 list_ele_t 為鏈結串列的節點作為儲存節點資訊，目的是為了與鏈結操作分開。 * list.h ```cpp /** * struct list_head - Head and node of a doubly-linked list * @prev: pointer to the previous node in the list * @next: pointer to the next node in the list */ struct list_head { struct list_head *prev, *next; }; ``` ```graphviz digraph main{ rankdir = "LR" subgraph cluster_list_head { style = invis; list_head_tit [label = "list head ( link )" shape = "none" width = 0.2 height = 0.1]; list_head [label = "{prev|next}" shape = "record"] } } ``` * merge_sort.c ```cpp typedef struct __element { char *value; struct __element *next; struct list_head list; } list_ele_t; typedef struct { list_ele_t *head; /* Linked list of elements */ list_ele_t *tail; size_t size; struct list_head list; } queue_t; ``` ```graphviz digraph main{ rankdir = "LR"; node [shape = "box"]; subgraph cluster_queue_t { style = invis; queue_t_tit [label = "queue_t ( head )" shape = "none" width = 0.2 height = 0.1]; queue_t_dis [label = "{{list_ele_t *|list_ele_t *|size_t |struct list_head}|{head|tail|size|list}}" shape = "record"]; } subgraph cluster_list_ele_t { style = invis; list_ele_t_tit [label = "list_ele_t ( node )" shape = "none" width = 0.2 height = 0.1]; list_ele_t_dis [label = "{{char *|struct __element|struct list_head}|{value|next|list}}" shape = "record"]; } } ``` ### 函式說明 ### 1. get_middle ```cpp= static list_ele_t *get_middle(struct list_head *list) { struct list_head *fast = list->next, *slow; list_for_each (slow, list) { if (fast->next == list || fast->next->next == list) break; fast = fast->next->next; } return list_entry(slow, list_ele_t, list); } ``` 對於如何得到環形鏈結串列的中間節點，重點在於第 5 行的判斷。這其實很好理解，和我之前看的 YouTube: [If Programming Was An Anime](https://youtu.be/pKO9UjSeLew?t=142) 有關 (雖然裡面主要講的是 [floyd's tortoise and hare](https://en.wikipedia.org/wiki/Cycle_detection)，但對可以讓利用速度快慢所造成的差異性理解更清楚)。 * 只要每次 `fast` 是 `slow` 的兩倍走訪節點數，就能確保在達成終止條件 `if (fast->next == list || fast->next->next == list)` 時 `slow` 在中間節點。 * **總結點為奇數** ```graphviz digraph main{ layout = "circo"; head [shape = "doublecircle"]; node [shape = "circle"]; head -> N0 -> N1 -> "slow(N2)" -> N3 -> "fast(N4)" -> N5 -> head; "fast(N4)" [ style = filled, fillcolor = yellow2]; "slow(N2)" [ style = filled, fillcolor = royalblue1, color = yellow2]; N0 [shape = "doublecircle", color = springgreen4]; N1 [color = royalblue1]; } ``` * **總結點為偶數** ```graphviz digraph main{ layout = "circo"; head [shape = "doublecircle"]; node [shape = "circle"]; head -> N0 -> N1 -> "slow(N2)" -> N3 -> "fast(N4)" -> head; "fast(N4)" [ style = filled, fillcolor = yellow2]; "slow(N2)" [ style = filled, fillcolor = royalblue1, color = yellow2]; N0 [shape = "doublecircle", color = springgreen4]; N1 [color = royalblue1]; } ``` ### 2. list_merge ```cpp static void list_merge(struct list_head *lhs, struct list_head *rhs, struct list_head *head) { INIT_LIST_HEAD(head); if (list_empty(lhs)) { list_splice_tail(lhs, head); return; } if (list_empty(rhs)) { list_splice_tail(rhs, head); return; } while (!list_empty(lhs) && !list_empty(rhs)) { char *lv = list_entry(lhs->next, list_ele_t, list)->value; char *rv = list_entry(rhs->next, list_ele_t, list)->value; struct list_head *tmp = strcmp(lv, rv) <= 0 ? lhs->next : rhs->next; list_del(tmp); list_add_tail(tmp, head); } list_splice_tail(list_empty(lhs) ? rhs : lhs, head); } ``` 此函式為 merge sort 重新連接的步驟，函式輸入 lhs (left), rhs (right), head (result) 。 * 當 lhs (left) 或 rhs (right) 其中一方為 emtpy 時，直接把另一方連結到 head 尾端。 * 否則就每次雙方依序比較節點大小，把較小的值儲存到 tmp 並刪除其在原先位置，隨後連接到 head 尾端，並重複此步驟直到兩者皆為 empty 。 ### 3. list_merge_sort ```cpp void list_merge_sort(queue_t *q) { if (list_is_singular(&q->list)) return; queue_t left; struct list_head sorted; INIT_LIST_HEAD(&left.list); list_cut_position(&left.list, &q->list, &get_middle(&q->list)->list); list_merge_sort(&left); list_merge_sort(q); list_merge(&left.list, &q->list, &sorted); INIT_LIST_HEAD(&q->list); list_splice_tail(&sorted, &q->list); } ``` 此函式為 merge sort 主要遞迴程式。 * 先以環形鏈結串列的中間截斷作為分節點，以 `list_cut_position` 把 `q` 的左半部(前半段)節點移動到 `left` 上。 * `left` 和 `q` 分別進行原函式遞迴，之後再利用 `list_merge` 合併到 `sorted`。 * 最後初始化 `q` ，把排序完畢的鏈結串列存回去 `q` 。 :::info Reference * [YouTube: Joma Tech - If Programming Was An Anime](https://youtu.be/pKO9UjSeLew?t=142) * [Wikipedia: Cycle detection](https://en.wikipedia.org/wiki/Cycle_detection) ::: ### 改良結構與連結方式我認為結構可以改成： * node ```cpp typedef struct __element { char *value; struct list_head list; } list_ele_t; ``` * head ```cpp typedef struct { size_t size; struct list_head list; } queue_t; ``` 把 `head` 和 `node` 的結構所包含的資訊分開，讓兩者持有自己類別的資訊。 ```graphviz digraph main{ rankdir = "LR"; node [shape = "box"]; subgraph cluster_queue_t { style = invis; queue_t_tit [label = "queue_t ( head )" shape = "none" width = 0.2 height = 0.1]; queue_t_dis [label = "{{size_t |struct list_head}|{size|list}}" shape = "record"]; } subgraph cluster_list_ele_t { style = invis; list_ele_t_tit [label = "list_ele_t ( node )" shape = "none" width = 0.2 height = 0.1]; list_ele_t_dis [label = "{{char *|struct list_head}|{value|list}}" shape = "record"]; } } ``` 因此關於相關的函式操作需更改成下列形式： ### 4. q_new ```cpp static queue_t *q_new() { queue_t *q = malloc(sizeof(queue_t)); if (!q) return NULL; q->size = 0; INIT_LIST_HEAD(&q->list); return q; } ``` ### 5. q_free ```cpp static void q_free(queue_t *q) { if (!q) return; struct list_head *item, *is = NULL; // list_for_each_safe for (item = q->list.next, is = item->next; item != &q->list; item = is, is = item->next) { list_ele_t *temp = list_entry(item, list_ele_t, list); list_del(&temp->list); free(temp->value); free(temp); } free(q); } ``` ### 6. q_insert_head ```cpp bool q_insert_head(queue_t *q, char *s) { if (!q) return false; list_ele_t *new_node = malloc(sizeof(list_ele_t)); if (!new_node) return false; new_node->value = strdup(s); if (!new_node->value) { free(new_node); return false; } q->size++; list_add_tail(&new_node->list, &q->list); return true; } ``` * 改之前 ```shell $ gcc -o test test.c -g $ valgrind --tool=memcheck --track-origins=yes ./test ==2384== error calling PR_SET_PTRACER, vgdb might block ==2384== ==2384== HEAP SUMMARY: ==2384== in use at exit: 0 bytes in 0 blocks ==2384== total heap usage: 163,245 allocs, 163,245 frees, 4,598,645 bytes allocated ==2384== ==2384== All heap blocks were freed -- no leaks are possible ==2384== ==2384== For lists of detected and suppressed errors, rerun with: -s ==2384== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) ``` >**4,598,645 bytes allocated** * 改之後 ```shell $ gcc -o merge merge_sort.c -g $ valgrind --tool=memcheck --track-origins=yes ./merge ==2339== error calling PR_SET_PTRACER, vgdb might block ==2339== ==2339== HEAP SUMMARY: ==2339== in use at exit: 0 bytes in 0 blocks ==2339== total heap usage: 163,245 allocs, 163,245 frees, 3,945,661 bytes allocated ==2339== ==2339== All heap blocks were freed -- no leaks are possible ==2339== ==2339== For lists of detected and suppressed errors, rerun with: -s ==2339== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) ``` >**3,945,661 bytes allocated** --- ## 2 Linux 核心 - lib/list_sort.c ### 實作可單獨執行 (standalone) 的使用層級應用程式 * #### list_sort.c > 完整實作可見: [main.c](https://github.com/linD026/linux2021_homeowork_quiz2/blob/main/main.c) ```cpp #include <assert.h> #include <stdio.h> #include <stdlib.h> #include "list.h" #include "list_sort.h" #define len 50 u16 values[len]; int main(int argc, char **argv) { // create struct list_head testlist; arrangement order = ascend; struct listitem *item, *is = NULL; size_t i; u16 seed = (uintptr_t)*argv; random_array(values, (u16)ARRAY_SIZE(values), seed); INIT_LIST_HEAD(&testlist); for (i = 0;i < ARRAY_SIZE(values);i++) { item = (struct listitem*)malloc(sizeof(struct listitem)); item->i = values[i]; list_add(&item->list, &testlist); } // sort list_sort(&order, &testlist, cmpu16); if(check(&testlist, &order, cmpu16)) printf("in order\n"); else printf("not in order\n"); print_check(&testlist); // free return 0; } ``` * #### list.h > 完整實作可見: [list.h](https://github.com/linD026/linux2021_homeowork_quiz2/blob/main/list.h) ```cpp #ifndef _LINUX_LIST_H #define _LINUX_LIST_H #define u8 uint8_t #define u16 uint16_t #define likely(x) __builtin_expect(!!(x), 1) #define unlikely(x) __builtin_expect((x), 0) #define ARRAY_SIZE(x) (sizeof(x) / sizeof(x[0])) struct list_head { struct list_head *next, *prev; }; struct listitem { u16 i; struct list_head list; }; typedef enum { ascend, descend } arrangement; int cmp(void *arrange, const struct list_head *a, const struct list_head *b) { const u16 ai = (const u16 )container_of(a, struct listitem, list)->i; const u16 bi = (const u16 )container_of(b, struct listitem, list)->i; const arrangement *arr_temp = (arrangement *)arrange; if (*arr_temp == descend) return bi - ai; else return ai - bi; } int check(struct list_head *head, void *piv, int (*cmp_func)(void *, const struct list_head *, const struct list_head *) ){ struct list_head *item, *is = NULL; u16 det = true; const struct list_head *value = NULL; list_for_each_safe(item, is, head){ if (det) { value = item; det = false; } else { if (cmp_func(piv, value, item) > 0 && *(arrangement*)piv == ascend) return false; value = item; } } return true; } #endif ``` ```shell $ gcc -o test list_sort.c -g $ valgrind --tool=memcheck --track-origins=yes ./test ==7455== error calling PR_SET_PTRACER, vgdb might block in order 2 8 12 27 40 42 64 96 102 121 144 192 198 223 264 283 332 356 390 501 ==7455== ==7455== HEAP SUMMARY: ==7455== in use at exit: 480 bytes in 20 blocks ==7455== total heap usage: 21 allocs, 1 frees, 4,576 bytes allocated ==7455== ==7455== LEAK SUMMARY: ==7455== definitely lost: 24 bytes in 1 blocks ==7455== indirectly lost: 456 bytes in 19 blocks ==7455== possibly lost: 0 bytes in 0 blocks ==7455== still reachable: 0 bytes in 0 blocks ==7455== suppressed: 0 bytes in 0 blocks ==7455== Rerun with --leak-check=full to see details of leaked memory ==7455== ==7455== For lists of detected and suppressed errors, rerun with: -s ==7455== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) ``` :::info Reference * [stackoverflow - Is __attribute__((nonnull)) is standard C](https://stackoverflow.com/questions/45237148/is-attribute-nonnull-is-standard-c) * [arm keil - __attribute__((nonnull))) function attribute](https://www.keil.com/support/man/docs/armcc/armcc_chr1359124976631.htm) * [bootlin.com - linux](https://elixir.bootlin.com/linux/v5.11.3/source/drivers) * [likely, unlikely 巨集與 GCC 內建函式 `__builtin_expect`](http://stenlyho.blogspot.com/2007/05/likelyunlikelygccbuiltinexpect.html) * Linux 原始程式碼 (GitHub) [include/linux/list.h](https://github.com/torvalds/linux/blob/master/include/linux/list.h#L505) ::: ### 設計檢測 * analysis.c > 完整實作可見: [analysis.c](https://github.com/linD026/linux2021_homeowork_quiz2/blob/main/analysis.c) ```cpp void analysis(void) { FILE *ptr = NULL; ptr = fopen("list_sort.txt", "w"); if(!ptr) return; printf("len:%d time:%d\n", len, times); arrangement order = ascend; struct list_head testlist; INIT_LIST_HEAD(&testlist); struct listitem *item, *is = NULL; size_t i; for (i = 0;i < len;i++) { item = (struct listitem*)malloc(sizeof(struct listitem)); list_add(&item->list, &testlist); } struct timespec time_start; struct timespec time_end; double during; for (int time_i = 0;time_i < times;time_i++) { //printf("%d\n", time_i); list_for_each_entry_safe(item, is, &testlist, list) { item->i = set_rand(); } clock_gettime(CLOCK_MONOTONIC, &time_start); list_sort(&order, &testlist, cmpu16); clock_gettime(CLOCK_MONOTONIC, &time_end); during = time_diff(time_start, time_end); if(check(&testlist, &order, cmpu16)) { printf("%d check\n", time_i); fprintf(ptr, "%d %f\n" , time_i, during); } } fclose(ptr); } ``` ![](https://imgur.com/1CeLUoJ.png) ### 說明 lib/list_sort.c 最佳化手法 temp note : * [/lib/list_sort.c](https://github.com/torvalds/linux/blob/master/lib/list_sort.c) ```cpp=132 /* * This mergesort is as eager as possible while always performing at least * 2:1 balanced merges. Given two pending sublists of size 2^k, they are * merged to a size-2^(k+1) list as soon as we have 2^k following elements. * * Thus, it will avoid cache thrashing as long as 3*2^k elements can * fit into the cache. Not quite as good as a fully-eager bottom-up * mergesort, but it does use 0.2*n fewer comparisons, so is faster in * the common case that everything fits into L1. */ ``` 根據註解此 mergesort 會盡可能地以 2:1 合併，因此當有兩個 $2^k$ 大小的串列會合併成 $2^{k+1}$ 大小的串列來達到避免 cache thrashing 直到 $3 \times 2^k$ 個元素至 cache 。再者雖然 mergesort 沒有 fully-eager bottom-up mergesort 好，但用較少的比較 ( $0.2 \times n$ )，因此只要在一般情況下跟 L1 合適就比較快。 > L1 為 CPU 的 cache 。可以下 `getconf -a | grep CACHE` 查看。 * #### [list_sort( )](https://github.com/torvalds/linux/blob/master/lib/list_sort.c) ```cpp=142 /* * The merging is controlled by "count", the number of elements in the * pending lists. This is beautiully simple code, but rather subtle. * * Each time we increment "count", we set one bit (bit k) and clear * bits k-1 .. 0. Each time this happens (except the very first time * for each bit, when count increments to 2^k), we merge two lists of * size 2^k into one list of size 2^(k+1). * * This merge happens exactly when the count reaches an odd multiple of * 2^k, which is when we have 2^k elements pending in smaller lists, * so it's safe to merge away two lists of size 2^k. * * After this happens twice, we have created two lists of size 2^(k+1), * which will be merged into a list of size 2^(k+2) before we create * a third list of size 2^(k+1), so there are never more than two pending. */ ``` >解釋分離與合併時的數量是如何符合最佳化。 ```cpp=158 /* The number of pending lists of size 2^k is determined by the * state of bit k of "count" plus two extra pieces of information: * * - The state of bit k-1 (when k == 0, consider bit -1 always set), and * - Whether the higher-order bits are zero or non-zero (i.e. * is count >= 2^(k+1)). * * There are six states we distinguish. "x" represents some arbitrary * bits, and "y" represents some arbitrary non-zero bits: * 0: 00x: 0 pending of size 2^k; x pending of sizes < 2^k * 1: 01x: 0 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k * 2: x10x: 0 pending of size 2^k; 2^k + x pending of sizes < 2^k * 3: x11x: 1 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k * 4: y00x: 1 pending of size 2^k; 2^k + x pending of sizes < 2^k * 5: y01x: 2 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k * (merge and loop back to state 2) */ ``` >說明對於數量是如掌控與操作。 ```cpp=191 { struct list_head *list = head->next, *pending = NULL; size_t count = 0; /* Count of pending */ if (list == head->prev) /* Zero or one elements */ return; /* Convert to a null-terminated singly-linked list. */ head->prev->next = NULL; /* * Data structure invariants: * - All lists are singly linked and null-terminated; prev * pointers are not maintained. * - pending is a prev-linked "list of lists" of sorted * sublists awaiting further merging. * - Each of the sorted sublists is power-of-two in size. * - Sublists are sorted by size and age, smallest & newest at front. * - There are zero to two sublists of each size. * - A pair of pending sublists are merged as soon as the number * of following pending elements equals their size (i.e. * each time count reaches an odd multiple of that size). * That ensures each later final merge will be at worst 2:1. * - Each round consists of: * - Merging the two sublists selected by the highest bit * which flips when count is incremented, and * - Adding an element from the input as a size-1 sublist. */ do { size_t bits; struct list_head **tail = &pending; /* Find the least-significant clear bit in count */ for (bits = count; bits & 1; bits >>= 1) tail = &(*tail)->prev; /* Do the indicated merge */ if (likely(bits)) { struct list_head *a = *tail, *b = a->prev; a = merge(priv, (cmp_func)cmp, b, a); /* Install the merged result in place of the inputs */ a->prev = b->prev; *tail = a; } /* Move one element from input list to pending */ list->prev = pending; pending = list; list = list->next; pending->next = NULL; count++; } while (list); ``` ```graphviz digraph main{ rankdir = LR; node [shape = box]; " " [shape = "none"]; list [label = "list", shape = "none"]; {rank = "same";list -> "pending(node 1)";} "pending(node 1)" -> " "; thr[label = "through" shape = "none"] " " -> thr[style = "invis"]; subgraph cluster_p{ label = tail; "pending(node 1)" } head -> "pending(node 1)"; } ``` * **count 為 1** ( least-significant clear bit ) ```graphviz digraph main{ bitlist [label = "...|0|0|0|1", shape = record]; } ``` > 會讓 tail 往右側移一位到 head 的位置。 > > 關於 bit 如何操作請看 [第 158 行](#list_sort-)。 * 下圖為 [list_sort( )](#list_sort-) 192 到 242 行圖解以 count 為 6 時的示意圖。 ```graphviz digraph main { graph [ rankdir = LR] node [shape = box] compound = true; subgraph cluster_merge{ label = "＜== merge side"; NULL subgraph cluster_pending{ label = "＜== append"; "...", "node 1", b; subgraph cluster_tail{ label = "tail" "pending(a)" } } subgraph cluster_list{ label = "linked" "list end", "list", "... " } } head -> "node 1"[label = "next"]; "node 1" -> "NULL "[label = "next"]; "node 1" -> " NULL "[label = "prev"]; "..." -> "node 1"[label = "prev"]; "pending(a)" -> b[label = "prev"]; b -> "..."[label = "prev"]; "list" -> "... "[label = "next"]; "... " -> "list"[label = "prev"]; "list end" -> "... "[label = "prev"] "... " -> "list end"[label = "next"]; "list" -> "pending(a)"[label = "prev"]; head -> "list end"[label = "prev"]; "list end"-> NULL[label = "next"]; {"list end", NULL}; } ``` 假設在 `list_sort()` 裡的 b 先排，則 tail 會在 a 上 : ```graphviz digraph main { node [shape = box]; rankdir = LR subgraph cluster_pending{ label = "＜== append"; "..." "node 1" subgraph cluster_tail{ label = "tail"; a } } list->a->"..."->"node 1"; b->a[label = "prev" constraint=false]; b->a[label = "next" constraint=false]; } ``` ```cpp=244 /* End of input; merge together all the pending lists. */ list = pending; pending = pending->prev; for (;;) { struct list_head *next = pending->prev; if (!next) break; list = merge(priv, (cmp_func)cmp, pending, list); pending = next; } /* The final merge, rebuilding prev links */ merge_final(priv, (cmp_func)cmp, head, pending, list); ``` 最後再把剩下的 pending 合併排序即可。 :::info Reference * [blog.csdn.net - 为什么CPU缓存会分为L1、L2、L3？](https://blog.csdn.net/dianhuizhan3102/article/details/101859331?utm_medium=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromBaidu-1.control&dist_request_id=&depth_1-utm_source=distribute.pc_relevant_t0.none-task-blog-BlogCommendFromBaidu-1.control) * [ibm.com - Cache size limits (JVM)](https://www.ibm.com/support/knowledgecenter/SSYKE2_8.0.0/com.ibm.java.vm.80.doc/docs/shrc_cache_size.html) 額外閱讀 * [researchgate.net - Parallel Merge Sort with Load Balancing](https://www.researchgate.net/publication/220091378_Parallel_Merge_Sort_with_Load_Balancing) ::: ## 3 效能比較 tree sort vs. merge sort * 以 10000 節點與 2000 次的重複執行，紀錄 tree-sort.c ( quiz1 ) 和 list_sort.c ( Linux ) 的動態記憶體配置與靜態 stack 所站的空間。 * 兩個程式排序的數值皆為 `uint16_t` 單位，且由以下程式碼給出： ```cpp static inline int self_random(int seed_f, int seed_s) { int sn_1 = 0; int sn_2 = 1; int sn = random() % 1024; int i = seed_f; while (i > 0) { sn = (sn_2 & sn_1) % (seed_s + 1); sn_1 = sn_1 + 3; sn_2 = sn_2 + 7; i--; } return sn; } uint16_t set_rand(void) { time_t current_time; srandom(time(&current_time)); return self_random(random() % 1024, random() % 1024); } ``` * 第一筆資料是以 gcc 裡的 `-fstack-usage` 得到，中間數字為 stack usage (單位: Byte)。 * 第二筆資料則是 `valgrind -v --leak-check=full` 。 * 第三筆為靜態資料，利用命令 `size` 。 * 第四筆則為 `gprof` 。 **關於完整 `analysis.txt` 資料請點擊函式標題。** * [tree-sort.c](https://github.com/linD026/linux2021_homeowork_quiz2/blob/main/analysis_tree_sort.txt) ```shell $ gcc -o test -I./include -I./private tests/tree-sort.c -g -fstack-usage list.h:83:20:INIT_LIST_HEAD 16 static list.h:109:20:list_add_tail 16 static list.h:135:20:list_del 16 static rbtree.h:69:20:rb_link_node 16 static rbtree.h:154:17:rb_first 16 static rbtree.h:166:17:rb_next 16 static rb_tree_augmented.h:115:20:rb_set_parent_color 16 static rb_tree_augmented.h:122:20:__rb_change_child 16 static rb_tree_augmented.h:305:31:rb_red_parent 16 static rb_tree_augmented.h:316:1:__rb_rotate_set_parents 64 static rb_tree_augmented.h:463:20:dummy_rotate 16 static rb_tree_augmented.h:466:6:rb_insert_color 80 static common.h:16:23:getnum 16 static common.h:34:17:get_unsigned16 32 static tree-sort.c:17:13:insert 80 static tree-sort.c:34:13:tree_sort 192 dynamic tree-sort.c:133:19:self_random 48 static tree-sort.c:147:10:set_rand 48 static tree-sort.c:156:6:analysis 160 static tree-sort.c:189:5:main 32 static $ valgrind -v --leak-check=full ./test --1341-- REDIR: 0x48f6850 (libc.so.6:free) redirected to 0x483c9d0 (free) ==1341== ==1341== HEAP SUMMARY: ==1341== in use at exit: 240,000 bytes in 10,000 blocks ==1341== total heap usage: 10,003 allocs, 3 frees, 248,664 bytes allocated ==1341== ==1341== Searching for pointers to 10,000 not-freed blocks ==1341== Checked 74,168 bytes $ size test text data bss dec hex filename 6866 678 2 7546 1d7a test $ gcc -pg -o tree_sort -I./include -I./private tests/tree-sort.c -g -fstack-usage $ ./tree_sort $ gprof tree_sort gmon.out > analysis.txt Each sample counts as 0.01 seconds. no time accumulated % cumulative self self total time seconds seconds calls Ts/call Ts/call name 0.00 0.00 0.00 89960398 0.00 0.00 rb_set_parent_color 0.00 0.00 0.00 59893329 0.00 0.00 rb_red_parent 0.00 0.00 0.00 20089752 0.00 0.00 dummy_rotate 0.00 0.00 0.00 20010000 0.00 0.00 list_add_tail 0.00 0.00 0.00 20002001 0.00 0.00 INIT_LIST_HEAD 0.00 0.00 0.00 20000000 0.00 0.00 insert 0.00 0.00 0.00 20000000 0.00 0.00 list_del 0.00 0.00 0.00 20000000 0.00 0.00 rb_insert_color 0.00 0.00 0.00 20000000 0.00 0.00 rb_link_node 0.00 0.00 0.00 20000000 0.00 0.00 rb_next 0.00 0.00 0.00 20000000 0.00 0.00 self_random 0.00 0.00 0.00 20000000 0.00 0.00 set_rand 0.00 0.00 0.00 19951723 0.00 0.00 __rb_change_child 0.00 0.00 0.00 19951723 0.00 0.00 __rb_rotate_set_parents 0.00 0.00 0.00 2000 0.00 0.00 rb_first 0.00 0.00 0.00 2000 0.00 0.00 tree_sort 0.00 0.00 0.00 1 0.00 0.00 analysis ``` * [list_sort.c ( + analysis.c )](https://github.com/linD026/linux2021_homeowork_quiz2/blob/main/analysis_list_sort.txt) ```shell $ gcc -o test list_sort.c analysis.c -g -fstack-usage list_sort.c:17:1:merge 80 static list_sort.c:44:1:merge_final 80 static list_sort.c:176:1:list_sort 128 static list.h:37:20:INIT_LIST_HEAD 16 static list.h:42:20:list_add 16 static list.h:109:19:cmpu16 16 static list.h:122:19:self_random 48 static analysis.c:10:5:set_rand 48 static analysis.c:18:6:analysis 160 static analysis.c:54:5:main 32 static $ valgrind -v --leak-check=full ./test --377-- REDIR: 0x48f6850 (libc.so.6:free) redirected to 0x483c9d0 (free) ==377== ==377== HEAP SUMMARY: ==377== in use at exit: 240,000 bytes in 10,000 blocks ==377== total heap usage: 10,003 allocs, 3 frees, 248,664 bytes allocated ==377== ==377== Searching for pointers to 10,000 not-freed blocks ==377== Checked 74,152 bytes $ size test text data bss dec hex filename 4496 656 8 5160 1428 test $ gcc -pg -o list_sort list_sort.c analysis.c -g -fstack-usage $ ./list_sort $ gprof list_sort gmon.out > analysis.txt Each sample counts as 0.01 seconds. no time accumulated % cumulative self self total time seconds seconds calls Ts/call Ts/call name 0.00 0.00 0.00 130915357 0.00 0.00 cmpu16 0.00 0.00 0.00 20000000 0.00 0.00 self_random 0.00 0.00 0.00 20000000 0.00 0.00 set_rand 0.00 0.00 0.00 19996000 0.00 0.00 merge 0.00 0.00 0.00 10000 0.00 0.00 list_add 0.00 0.00 0.00 2000 0.00 0.00 list_sort 0.00 0.00 0.00 2000 0.00 0.00 merge_final 0.00 0.00 0.00 1 0.00 0.00 INIT_LIST_HEAD 0.00 0.00 0.00 1 0.00 0.00 analysis ``` | | tree-sort.c | list_sort.c | | ----------------:|:---------------------------------------------------:|:---------------------------------------------------:| | total heap usage | 10,003 allocs 3 frees 248,664 bytes allocated | 10,003 allocs 3 frees 248,664 bytes allocated | | main function | 192 dynamic (tree_sort) | 128 static (list_sort) | ![](https://imgur.com/cdzYDCx.png) ![](https://imgur.com/6lwQewV.png) ![](https://imgur.com/3SOnC4q.png) 可以看出雖然 list-sort `cmpu16` 高達 130915357 比其他函式呼叫的多，但在從時間花費上可以看出其所造成的成本並不比 tree sort 其他函式呼叫累積起來的成本高。然而上述資料所呈現的記憶體使用量沒有執行時期的紀錄，因此我開一個執行緒利用 `#include <sys/resource.h>` 裡的 `struct rusage` 來記錄每 1ms 記憶體的使用量，但這次節點數為 10000 、次數 100 次： ![](https://imgur.com/z8YWrEI.png) ![](https://imgur.com/SWalqkg.png) * 詳細程式碼請看 [`dtusage.h`](https://github.com/linD026/linux2021_homeowork_quiz2/blob/thread_p/dtusage.h) 。 * 可以明顯看出 tree sort 的記憶體使用量確實較大，這點我認為是因為 tree sort 在排序時會需要建立 BST 在與 list-sort ( merge sort )相比產生較多的額外記憶體空間。 * 經老師提出： > 為何不用 valgrind --tool=massif 呢？沒想到 valgrind 有這功能因此試了一下，只貼上記憶體圖形化與 stack 使用率較為突出的部分： > 註：tree-sort.c 有較高的 stack 使用量，然而`list-sort.c` 除了 n = 1 比較高外都皆為 480 。 > > 關於此問題請看 [n = 1 stack 使用量大問題](#n--1-stack-使用量大問題)。 * tree-sort.c ```shell KB 712.4^ : @ @ |#:: ::: : @ ::@ : |#:: : : : @ ::@ : |#:: : : : @ ::@ : |#:: : : : @ ::@ : |#:: : : : @ ::@ : |#:: : : : @ ::@ : |#:: : : : @ ::@ : |#:: : : : @ ::@ : |#:::::::::::::: @:::@::::::::::::@:::::::@:::@@:::@:::::@::::::@::::::@: |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@: |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@: |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@: |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@: |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@: |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@: |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@: |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@: |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@: |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@: 0 +----------------------------------------------------------------------->Gi 0 12.54 Number of snapshots: 84 Detailed snapshots: [1 (peak), 16, 20, 33, 42, 46, 50, 60, 70, 80] -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 0 0 0 0 0 0 1 140,839,206 725,216 244,568 160,024 320,624 33.72% (244,568B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->33.09% (240,000B) 0x109F13: analysis (tree-sort.c:168) | ->33.09% (240,000B) 0x10A0C2: main (tree-sort.c:196) | ->00.63% (4,568B) in 1+ places, all below ms_print's threshold (01.00%) -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 2 293,986,963 729,392 248,664 160,032 320,696 3 442,401,488 729,392 248,664 160,032 320,696 4 605,555,715 409,176 248,664 160,032 480 5 770,853,878 409,176 248,664 160,032 480 6 1,049,488,316 409,176 248,664 160,032 480 7 1,228,610,437 409,176 248,664 160,032 480 8 1,504,261,304 409,176 248,664 160,032 480 9 1,779,934,462 409,176 248,664 160,032 480 10 1,926,957,892 409,176 248,664 160,032 480 11 2,197,708,686 409,176 248,664 160,032 480 12 2,398,560,278 409,176 248,664 160,032 480 13 2,549,199,492 409,176 248,664 160,032 480 14 2,750,050,755 729,392 248,664 160,032 320,696 15 3,026,222,400 729,472 248,664 160,032 320,776 16 3,176,865,118 409,176 248,664 160,032 480 60.77% (248,664B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->58.65% (240,000B) 0x109F13: analysis (tree-sort.c:168) | ->58.65% (240,000B) 0x10A0C2: main (tree-sort.c:196) -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 61 10,728,706,661 409,168 248,664 160,032 472 62 10,852,877,952 409,176 248,664 160,032 480 63 10,977,050,052 409,176 248,664 160,032 480 64 11,101,221,318 409,176 248,664 160,032 480 65 11,225,393,418 409,176 248,664 160,032 480 66 11,349,558,940 409,168 248,664 160,032 472 67 11,473,724,463 729,312 248,664 160,032 320,616 68 11,597,890,448 409,176 248,664 160,032 480 69 11,722,055,973 729,400 248,664 160,032 320,704 70 11,846,221,495 729,472 248,664 160,032 320,776 34.09% (248,664B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->32.90% (240,000B) 0x109F13: analysis (tree-sort.c:168) | ->32.90% (240,000B) 0x10A0C2: main (tree-sort.c:196) ``` * list-sort.c (+ analysis.c ) ```shell KB 401.5^# |#:::::::::::::::::::::::@::@::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@ 0 +----------------------------------------------------------------------->Gi 0 12.62 Number of snapshots: 97 Detailed snapshots: [1 (peak), 28, 30, 44, 55, 65, 75, 85, 95] -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 0 0 0 0 0 0 1 172,168,618 411,160 248,664 160,032 2,464 60.48% (248,664B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc. ->58.37% (240,000B) 0x109868: analysis (analysis.c:30) | ->58.37% (240,000B) 0x109A23: main (analysis.c:62) -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 96 13,553,440,562 409,176 248,664 160,032 480 ``` #### n = 1 stack 使用量大問題經反覆測試與檢測初始化所造成的成本發現， n = 1 會提高的原因是因為 analysis.c 的 linked list 是先配置完記憶體後，進入數測量次數的迴圈再走訪一次設定亂數。會這麼做的原因是因為程式比較好寫，不須寫例外狀況(次數為一)。因此第一次會因為呼叫 `list_add` 與設定鑑測資料(如時間)而導致 n = 1成本變高。以下是函式簡化後： - [ ] analysis.c (簡化) ```cpp #define len 1000000 #define times 1 int main(void) { printf("len:%d time:%d\n", len, times); arrangement order = ascend; struct list_head testlist; INIT_LIST_HEAD(&testlist); struct listitem *item, *is = NULL; size_t i; for (i = 0;i < len;i++) { item = (struct listitem*)malloc(sizeof(struct listitem)); list_add(&item->list, &testlist); } for (int time_i = 0;time_i < times;time_i++) { list_for_each_entry_safe(item, is, &testlist, list) { item->i = set_rand(); } list_sort(&order, &testlist, cmpu16); } } ``` ``` MB 38.15^ :::: |##::::::::::::::::::::::::::::::::::::::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@::::::: 0 +----------------------------------------------------------------------->Gi 0 10.50 -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 0 0 0 0 0 0 1 63,543,365 39,635,968 23,781,736 15,853,816 416 2 326,840,523 40,001,432 24,001,024 16,000,008 400 50 9,180,465,501 40,001,432 24,001,024 16,000,008 400 67 11,273,922,938 40,001,584 24,001,024 16,000,008 552 ``` - [ ] main.c ( 簡化 ) ```cpp #define len 1000000 u16 values[len]; int main(int argc, char **argv) { // create struct list_head testlist; arrangement order = ascend; struct listitem *item, *is = NULL; size_t i; u16 seed = (uintptr_t)*argv; INIT_LIST_HEAD(&testlist); random_u16(&testlist, len, seed); // sort list_sort(&order, &testlist, cmpu16); if(check(&testlist, &order, cmpu16)) printf("in order\n"); else printf("not in order\n"); // free return 0; } ``` ``` MB 38.15^ :: : : | @@#:::::::@:: | @@@@@#:::::::@:: | @@@@@@@@#:::::::@:: | :::@@@@@@@@#:::::::@:: | @@::::@@@@@@@@#:::::::@:: | @@@@@@::::@@@@@@@@#:::::::@:: | @@@@@@@@::::@@@@@@@@#:::::::@:: | @@@@@@@@@@@::::@@@@@@@@#:::::::@:: | ::@@@@ @@@@@@@@::::@@@@@@@@#:::::::@:: | @@:::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@:: | @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@:: | @@@@@@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@:: | @:@@@@ @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@:: | @@@@:@@@@ @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@:: | @@@@@@@:@@@@ @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@:: | @@:@@@@@@@:@@@@ @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@:: | @@:@@@:@@@@@@@:@@@@ @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@:: | @@@@:@@@:@@@@@@@:@@@@ @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@:: | @@@@@@@:@@@:@@@@@@@:@@@@ @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@:: 0 +----------------------------------------------------------------------->Gi 0 6.900 -------------------------------------------------------------------------------- n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B) -------------------------------------------------------------------------------- 0 0 0 0 0 0 1 142,377,837 887,104 531,984 354,656 464 2 297,677,838 1,854,024 1,112,136 741,424 464 50 5,602,886,870 34,932,744 20,959,368 13,972,912 464 77 7,409,064,283 40,000,520 24,000,000 16,000,000 520 ``` 可以看出 stack 被壓縮到先前運行中段時的成本 480 。 :::info Reference * [thegeekstuff.com - gprof-tutorial](https://www.thegeekstuff.com/2012/08/gprof-tutorial/) * [stackoverflow - How can I profile C++ code running on Linux?](https://stackoverflow.com/questions/375913/how-can-i-profile-c-code-running-on-linux/378024#378024) * [stackoverflow - Alternatives to gprof [closed]](https://stackoverflow.com/questions/1777556/alternatives-to-gprof/1779343#1779343) * [stackoverflow - How to use sem_trywait()?](https://stackoverflow.com/questions/27294954/how-to-use-sem-trywait) * [stackoverflow - How to use nanosleep() in C? What are `tim.tv_sec` and `tim.tv_nsec`?](https://stackoverflow.com/questions/7684359/how-to-use-nanosleep-in-c-what-are-tim-tv-sec-and-tim-tv-nsec) * [stackoverflow - Getting getrusage() to measure system time in C ](https://stackoverflow.com/questions/10509660/getting-getrusage-to-measure-system-time-in-c) * [stackoverflow - How can I wait for any/all pthreads to complete?](https://stackoverflow.com/questions/6154539/how-can-i-wait-for-any-all-pthreads-to-complete) * [sem_wait(3) — Linux manual page](https://man7.org/linux/man-pages/man3/sem_wait.3.html) * [getrusage(2) — Linux manual page](https://man7.org/linux/man-pages/man2/getrusage.2.html) * [pthread_create(3) — Linux manual page](https://man7.org/linux/man-pages/man3/pthread_create.3.html) * [computing.llnl.gov - POSIX Threads Programming](https://computing.llnl.gov/tutorials/pthreads/) * [valgrind.org - 9. Massif: a heap profiler](https://valgrind.org/docs/manual/ms-manual.html) ::: --- ## 測驗 `2` ## 1 解釋程式運作原理 >考慮函式 func 接受一個 16 位元無號整數 $N$，並回傳小於或等於 $N$ 的 power-of-2 (漢字可寫作為 2 的冪) ```cpp= uint16_t func(uint16_t N) { /* change all right side bits to 1 */ N |= N >> 1; N |= N >> 2; N |= N >> 4; N |= N >> 8; return (N + 1) >> 1; } ``` * example: N = $10101_2$ = $21_{10}$ * first ```graphviz digraph main{ graph [rankdir = LR]; node [shape = record]; bl_o_1 [label = "{<0>0|<1>0|<2>0|<3>0}"]; bl_o_2 [label = "{<4>0|<5>0|<6>0|<7>0}"]; bl_o_3 [label = "{<8>0|<9>0|<10>0|<11>1}"]; bl_o_4 [label = "{<12>0|<13>1|<14>0|<15>1}"]; bl_o_1->bl_o_2[style = invis]; bl_o_2->bl_o_3[style = invis]; bl_o_3->bl_o_4[style = invis]; } ``` * `N |= N >> 1;` ```graphviz digraph main{ graph [rankdir = LR]; node [shape = record]; bl_1_1 [label = "{<0>0|<1>0|<2>0|<3>0}"]; bl_1_2 [label = "{<4>0|<5>0|<6>0|<7>0}"]; bl_1_3 [label = "{<8>0|<9>0|<10>0|<11>0}"]; bl_1_4 [label = "{<12>1|<13>0|<14>1|<15>0}"]; bl_1_1->bl_1_2[style = invis]; bl_1_2->bl_1_3[style = invis]; bl_1_3->bl_1_4[style = invis]; } ``` ```graphviz digraph main{ graph [rankdir = LR]; node [shape = record]; bl_2_1 [label = "{<0>0|<1>0|<2>0|<3>0}"]; bl_2_2 [label = "{<4>0|<5>0|<6>0|<7>0}"]; bl_2_3 [label = "{<8>0|<9>0|<10>0|<11>1}"]; bl_2_4 [label = "{<12>1|<13>1|<14>1|<15>1}"]; bl_2_1->bl_2_2[style = invis]; bl_2_2->bl_2_3[style = invis]; bl_2_3->bl_2_4[style = invis]; } ``` > 先省略 4 到 5 行，因為這裡還用不到。 * `return (N + 1) >> 1;` ```graphviz digraph main{ graph [rankdir = LR]; node [shape = record]; bl_3_1 [label = "{<0>0|<1>0|<2>0|<3>0}"]; bl_3_2 [label = "{<4>0|<5>0|<6>0|<7>0}"]; bl_3_3 [label = "{<8>0|<9>0|<10>0|<11>1}"]; bl_3_4 [label = "{<12>0|<13>0|<14>0|<15>0}"]; bl_3_1->bl_3_2[style = invis]; bl_3_2->bl_3_3[style = invis]; bl_3_3->bl_3_4[style = invis]; bl_3_3:10 -> bl_3_3:11; #bl_o_1 [label = "{<0>|<1>|<2>|<3>}"]; #bl_o_2 [label = "{<4>|<5>|<6>|<7>}"]; #bl_o_3 [label = "{<8>|<9>|<10>|<11>}"]; #bl_o_4 [label = "{<12>|<13>|<14>|<15>}"]; } ``` * 最後會回傳 **16 (0001_0000)** 。 * 可以理解成在進行 3 到 6 行的右移是為了保證最高是 1 的位元以下(以後)都是 1 。 * 之後對其進行加一進位，隨後在往右移一位即得小於或等於輸入數值的 power-of-2。 ### C11 6.3 Conversions **但這就有問題了**，如果是 65535 (1111_1111_1111_1111) 預期結果會是什麼？如果是以平常寫程式的經驗推論，會是在 3 到 6 行都是與輸入一樣後，加 1 使其 overflow 為 [1]_0000_0000_0000 再 `>> 1` 。 **應該等於 0 吧。** 然而實測卻是這樣： ```cpp $ cat main.c #include <stdio.h> #include <stdint.h> uint16_t func(uint16_t N) { /* change all right side bits to 1 */ N |= N >> 1; N |= N >> 2; N |= N >> 4; N |= N >> 8; return (N + 1) >> 1; } int main(){ uint16_t num = 65535; printf("%%d %d %d\n", num, func(num)); printf("%%u %d %u\n", num, func(num)); printf("%%hu %d %hu\n", num, func(num)); } $ gcc -o test main.c $ ./test %d 65535 32768 %u 65535 32768 %hu 65535 32768 ``` > 可能會覺得 printf 的型態 ( %d ) 是 int 非 uint16_t ( unsigned short )，但這並不能解釋傳輸的值為 `.. 1000 0000 0000 0000` 。 ( 已補上各個型態輸出 ) 因此閱讀了 **C11 N2310 ISO/IEC 9899:~~2017~~2x(E)** 的 **6.3 Conversions** 後有了些想法： 以下若非特別註明皆引述自 C11 N2310 ISO/IEC 9899:~~2017~~2x(E)。 * **6.3** > **1** Several operators convert operand values from one type to another automatically. This subclause specifies the result required from such an implicit conversion, as well as those that result from a cast operation (an explicit conversion). 數個 operators 會自動地對 operand values 進行型態轉換。 * **6.3.1.1 Boolean, characters, and integers - 1** > **—** The rank of a signed integer type shall be **greater than** the rank of any signed integer type **with less precision**. > **—** The rank of long long int shall be greater than the rank of long int, which shall be greater than the rank of int, **which shall be greater than the rank of short int**, which shall be greater than the rank of signed char. > **—** The rank of any **unsigned integer type shall equal the rank of the corresponding signed integer type, if any.** 對各個型態的 integer 之間的 rank 有明確規範。 1. the rank of integer > the rank of short integer 2. the rank of unsigned int == the rank of the corresponding signed integer 3. The rank of signed integer > the rank of any signed integer type with less precision * **6.3.1.3 Signed and unsigned integers** > **1** When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged. > **2** Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type$.^{61}$) 一個 integer type 再轉換到另一個 integer type 時，若可表示則不變。 * **6.3.1.8 Usual arithmetic conversions** > **1** Many operators that expect operands of arithmetic type cause conversions and yield result types in a similar way. The purpose is to determine a common real type for the operands and result. For the specified operands, each operand is converted, without change of type domain, to a type whose corresponding real type is the common real type. Unless explicitly stated otherwise, the common real type is also the corresponding real type of the result, whose type domain is the type domain of the operands if they are the same, and complex otherwise. This pattern is called the **usual arithmetic conversions** : > ... > Otherwise, if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand, then the operand with signed integer type is converted to the type of the operand with unsigned integer type. > Otherwise, **if the type of the operand with signed integer type can represent all of the values of the type of the operand with unsigned integer type, then the operand with unsigned integer type is converted to the type of the operand with signed integer type.** 1. 如果 unsigned integer 的 rank 大於等於另一個整數型態，另個會被轉換成 unsigned integer 。 2. 如果 signed integer type 可以表示所有 unsigned integer type 的值則 unsigned integer type 會被轉換成 signed integer type * **6.4.4.1 Integer constants - Semantics** > **5** The type of an integer constant is the first of the corresponding list in which its value can be represented. ![](https://imgur.com/Gcwr4SN.png) 可以知道 1 會是 int 型態，亦即 int32_t 。 * 最後可以總結出： ```cpp return (N + 1) >> 1; ``` 型態為 `uint16_t` 的變數 `N` 會因為 1 轉成有號整數 (`int32_t`) 型態，再做 right shift 。 > 0000 0000 0000 0001 _ 0000 0000 0000 0000 * 而關於 shift，規格書也有規範到 (**6.5.7 Bitwise shift operators**) ： > **5** The result of $E1$ >> $E2$ is $E1$ right-shifted $E2$ bit positions. **If $E1$ has an unsigned type or if $E1$ has asigned type and a nonnegative value, the value of the result is the integral part of the quotient of $E1/2^{E2}$.** If E1 has a signed type and a negative value, the resulting value is implementation-defined. > 0000 0000 0000 0000 _ 1000 0000 0000 0000 => 32768 * 最後再以 uint16_t 的 32768 (1000 0000 0000 0000) 傳出函式。 ## 2 is_power_of_2 ### 說明 is_power_of_2 原理 >* [kernel api - is_power_of_2 ](https://www.kernel.org/doc/html/latest/core-api/kernel-api.html) >bool is_power_of_2(unsigned long n) >>Determine whether some value is a power of two, where zero is not considered a power of two. - [ ] include/linux/log2.h ```cpp /** * is_power_of_2() - check if a value is a power of two * @n: the value to check * * Determine whether some value is a power of two, where zero is * *not* considered a power of two. * Return: true if @n is a power of 2, otherwise false. */ static inline __attribute__((const)) bool is_power_of_2(unsigned long n) { return (n != 0 && ((n & (n - 1)) == 0)); } ``` > 確認 n 不為零，且在與自己減一過後的值 AND 後為零的話回傳 true 。 >(1000 & 0111) == 0 => true >(1111 & 1110) == 0 => false ```cpp /** * __roundup_pow_of_two() - round up to nearest power of two * @n: value to round up */ static inline __attribute__((const)) unsigned long __roundup_pow_of_two(unsigned long n) { return 1UL << fls_long(n - 1); } ``` ```cpp /** * __rounddown_pow_of_two() - round down to nearest power of two * @n: value to round down */ static inline __attribute__((const)) unsigned long __rounddown_pow_of_two(unsigned long n) { return 1UL << (fls_long(n) - 1); } ``` 1UL 為 unsigned long 型的 1 。 ```cpp static inline unsigned fls_long(unsigned long l) { if (sizeof(l) == 4) return fls(l); return fls64(l); } ``` 會區分 32 與 64 位元。而繼續追 `fls64` 會查到是在 `arch` 目錄下： > arch/alpha/include/asm/bitops.h, line 369 (as a function) arch/alpha/include/asm/bitops.h, line 376 (as a function) arch/powerpc/include/asm/bitops.h, line 235 (as a function) arch/s390/include/asm/bitops.h, line 395 (as a function) arch/x86/include/asm/bitops.h, line 366 (as a function) 可以看到會因各個架構而有不同定義，在 x86 底下： [Linux kernel source tree](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/asm/bitops.h?h=v5.12-rc2#n324) 。如果查看 alpha 會是： ```cpp=365 /* * fls: find last bit set. */ #if defined(CONFIG_ALPHA_EV6) && defined(CONFIG_ALPHA_EV67) static inline int fls64(unsigned long word) { return 64 - __kernel_ctlz(word); } #else extern const unsigned char __flsm1_tab[256]; static inline int fls64(unsigned long x) { unsigned long t, a, r; t = __kernel_cmpbge (x, 0x0101010101010101UL); a = __flsm1_tab[t]; t = __kernel_extbl (x, a); r = a*8 + __flsm1_tab[t] + (x != 0); return r; } #endif static inline unsigned long __fls(unsigned long x) { return fls64(x) - 1; } static inline int fls(unsigned int x) { return fls64(x); } ``` ```cpp # define __kernel_ctlz(x) __builtin_clzl(x) ``` > Built-in Function: [int __builtin_clzl (unsigned long)](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) **Similar to `__builtin_clz`, except the argument type is unsigned long.** 而 `__builtin_clz` 則是計算從 most significant bit 開始有幾個 0 ，亦即在到第一個 1 之前有幾個 0 。 * 舉例在 alpha 下， n = 139,611,588,448,485,376 ( 0000 0001 1111 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ) * **`roundup_pow_of_two`** `fls_long` (fls64) 會是 64 - 7 = 57 。 1UL << 57 = 144,115,188,075,855,872 ( 0000 0010 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ) * **`rounddown_pow_of_two`** `fls_long` (fls64) 會是 64 - 7 = 57 。 1UL << (57 - 1) = 72,057,594,037,927,936 ( 0000 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 ) ### 考量 * 在記憶體分配的 [buddy memory allcation](https://en.wikipedia.org/wiki/Buddy_memory_allocation) 原理： > There are various forms of the buddy system; those in which each block is subdivided into two smaller blocks are the simplest and most common variety. **Every memory block in this system has an order, where the order is an integer ranging from 0 to a specified upper limit. The size of a block of order n is proportional to $2^n$, so that the blocks are exactly twice the size of blocks that are one order lower.** Power-of-two block sizes make address computation simple, because all buddies are aligned on memory address boundaries that are powers of two. When a larger block is split, it is divided into two smaller blocks, and each smaller block becomes a unique buddy to the other. A split block can only be merged with its unique buddy block, which then reforms the larger block they were split from. 可以知道記憶體會被切分成大小為 $2^n$ 的 block ，且是以其作單位去分配給行程等使用。 [wikipedia](https://en.wikipedia.org/wiki/Buddy_memory_allocation) 給了一個很好理解的例子請去查看 [budd memory allocation - example](https://en.wikipedia.org/wiki/Buddy_memory_allocation) 。如果查 [國家教育研究院的雙語詞彙、學術名詞暨辭書資訊網](https://terms.naer.edu.tw/detail/1273777/) 也可以查到說明 > 伙伴系統 buddy system 2003年6月資訊與通信術語辭典名詞解釋: 資料結構的一種儲存器管理系統。系統中儲存器的模塊總是劃分成2的冪次。如果一個模塊比存儲要求大得多，就將它分成兩個規模都是2的冪次的模塊，稱此兩模塊為〝伙伴〞。如果兩個伙伴均為可用，可再重新組合成一個較大的模塊，因為它們都是2的冪次，所以容易計算出模塊的伙伴的位址，而後確定是否可以組合成大模塊。 * 但不可能每個去要記憶體的行程都剛好使用 $2^n$ 大小，因此在 Linux 核心內部有 slab allocator 去管理剩下空閒的記憶體。 * 在其中，如何知曉資料身處在哪個 block 此 `is_power_of_2` 等函式就顯得重要許多。 ## 3 slab allocator ### slab allocator 簡介 * **[OSTEP : Free-Space Management](https://pages.cs.wisc.edu/~remzi/OSTEP/vm-freespace.pdf)** > Just like any good idea, this approach introduces new complications into a system as well. For example, **how much memory should one dedicate to the pool of memory that serves specialized requests of a given size, as opposed to the general pool?** One particular allocator, the slab allocator by uber-engineer Jeff Bonwick (which was designed for use in the Solaris kernel), handles this issue in a rather nice way [B94]. > > Specifically,** when the kernel boots up, it allocates a number of object caches for kernel objects that are likely to be requested frequently (such as locks, file-system inodes, etc.)**; the object caches thus are each segregated free lists of a given size and serve memory allocation and free requests quickly. **When a given cache is running low on free space, it requests some slabs of memory from a more general memory allocator (the total amount requested being a multiple of the page size and the object in question).** * [linuxfound.org](https://events.static.linuxfound.org/sites/events/files/slides/slaballocators.pdf) > ![](https://imgur.com/s40E2jX.png) >> 完整請看 [YouTube : SL[AUO]B: Kernel memory allocator design and philosophy](https://www.youtube.com/watch?v=h0VMLXavx30) * [Linux 核心設計 - 記憶體管理](https://hackmd.io/@sysprog/linux-memory?type=view) > [YouTube : Linux 核心設計：記憶體管理 (二) (2019-04-14)](https://www.youtube.com/watch?v=0oABzHrZDPY) >> 註：此標示僅為簡介，若要完整知道請全部看完三部共 8 小時 46 分鐘左右。 ### is_power_of_2 在 Linux kernel 的使用 - [ ] `mm/slab_common.c` ```cpp=541 /* Create a cache during boot when no slab services are available yet */ void __init create_boot_cache(struct kmem_cache *s, const char *name, unsigned int size, slab_flags_t flags, unsigned int useroffset, unsigned int usersize) { int err; unsigned int align = ARCH_KMALLOC_MINALIGN; s->name = name; s->size = s->object_size = size; /* * For power of two sizes, guarantee natural alignment for kmalloc * caches, regardless of SL*B debugging options. */ if (is_power_of_2(size)) align = max(align, size); s->align = calculate_alignment(flags, align, size); s->useroffset = useroffset; s->usersize = usersize; err = __kmem_cache_create(s, flags); if (err) panic("Creation of kmalloc slab %s size=%u failed. Reason %d\n", name, size, err); s->refcount = -1; /* Exempt from merging for now */ } ``` - [ ] `mm/slob.c` ```cpp=464 /* * End of slob allocator proper. Begin kmem_cache_alloc and kmalloc frontend. */ static __always_inline void * __do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller) { unsigned int *m; int minalign = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN); void *ret; gfp &= gfp_allowed_mask; might_alloc(gfp); if (size < PAGE_SIZE - minalign) { int align = minalign; /* * For power of two sizes, guarantee natural alignment for * kmalloc()'d objects. */ if (is_power_of_2(size)) align = max(minalign, (int) size); if (!size) return ZERO_SIZE_PTR; m = slob_alloc(size + minalign, gfp, align, node, minalign); if (!m) return NULL; *m = size; ret = (void *)m + minalign; trace_kmalloc_node(caller, ret, size, size + minalign, gfp, node); } else { unsigned int order = get_order(size); if (likely(order)) gfp |= __GFP_COMP; ret = slob_new_pages(gfp, order, node); trace_kmalloc_node(caller, ret, size, PAGE_SIZE << order, gfp, node); } kmemleak_alloc(ret, size, 1, gfp); return ret; } ``` 在 `mm/slab_common.c` 第 556 行和 `mm/slob.c` 第 483 行說道，是為了確保在 kmalloc caches 是對齊。 * `create_boot_cache` 在 `mm/slab.c` , `mm/slab_common.c` 和 `mm/slub.c` 被呼叫。 * `__do_kmalloc_node` 則在 `mm/slab.c` , `mm/slob.c` 被呼叫。而關於 `.. guarantee natural alignment for kmalloc ..` 這行註解，可以從 lwn 的一篇文章看起 [Alignment guarantees for kmalloc()](https://lwn.net/Articles/787740/) ： > In particular, Babka wanted to discuss when kmalloc() should return objects with their **"natural alignment"**. What that term means was not actually defined until near the end of the session; presumably nobody will object to a slight bit of reordering at this point. **Natural alignment for an object means that its beginning address is a multiple of each of up to three different quantities:** **ARCH_KMALLOC_MINALIGN** (eight bytes, by default), **the cache-line size** (for larger objects), and **the size of the object itself when that size is a power of two**. **The actual required alignment will generally be the least common multiple of the three.** 可以看到對於 natural alignment 的說明，然而此篇文章主要是在探討 kmalloc 的意外狀況： > **Most of the time, Babka said, kmalloc() already returns naturally aligned objects for a power-of-two allocation size; that result falls out of how the slab pages are laid out.** But there are exceptions: when **SLUB debugging** is enabled or when the **SLOB allocator** is used. Few people worry much about SLOB, but the SLUB debugging exception can lead to problems for unsuspecting developers. 有了對 natural alignment 定義後，可以再看 natural alignement 的大略描述，引自 [Red Hat Developer blog](https://developers.redhat.com/blog/2016/06/01/how-to-avoid-wasting-megabytes-of-memory-a-few-bytes-at-a-time/) ： > Natural alignment describes the practice in which the address of the data type is a multiple of the size of the data type. Using natural alignment allows the processor to avoid doing multiple memory operations to access a single value. This natural alignment has a cost, and it can lead to larger data structures. > >To improve the speed of memory accesses and processor performance, each memory access on modern processors deals with a larger group of bits than a single eight bit byte, typically, 64 bits (eight bytes) or 128 bits (16 bytes). > > For example, **a double precision floating point value is made up of eight bytes (b0 through b7)** and we assume the memory width is eight bytes. If the double precision floating point number **is naturally aligned**, **the entire value starting at address+0 can be read with a single memory access** (as can be seen in the diagram below.) ![](https://imgur.com/y1oENwS.png) > However, if the floating point number **is not aligned (unaligned), it straddles two words in memory and two read must be performed (one read for b0 through b3 and another read for b4 through b7)**. Additionally, the processor needs to paste the bits from the two reads into a single 64-bit value which also takes time. 可以看到是為了對 CPU 讀取數值時的次數等最佳化。此篇內容也有提到如果進行了 Alignment 可能會因為原先要儲存的數值小於對齊的需求，因而有 padding 來填充使其符合大小。可以利用 gcc 的 -Wpadded 來檢查是否有 padding 。文中也給出了很好的例子來告訴我們再知曉其操作後如何 re-order 這些資料： ```cpp struct edge { int from; \\ 4 bytes + ( padding 4 bytes ) double distance; \\ 8 bytes int to; \\ 4 bytes + ( padding 4 bytes ) } edge; ``` ```cpp struct edge { int from; \\ 4 bytes + 4 bytes int to; double distance; \\ 8 bytes }; ``` > 在 x86_64 下。詳細請點[此連結](https://developers.redhat.com/blog/2016/06/01/how-to-avoid-wasting-megabytes-of-memory-a-few-bytes-at-a-time/)。 :::info Reference * [lwn.net - Alignment guarantees for kmalloc()](https://lwn.net/Articles/787740/) * [redhat.com - How to avoid wasting megabytes of memory a few bytes at a time](https://developers.redhat.com/blog/2016/06/01/how-to-avoid-wasting-megabytes-of-memory-a-few-bytes-at-a-time/) 額外閱讀 * [I/O Performance HOWTO - Avoiding Bounce Buffers](https://tldp.org/HOWTO/IO-Perf-HOWTO/overview.html) ::: - [ ] [`mm/percpu.c`](https://elixir.bootlin.com/linux/latest/source/mm/percpu.c#L1713) ```cpp=1661 /** * pcpu_alloc - the percpu allocator * @size: size of area to allocate in bytes * @align: alignment of area (max PAGE_SIZE) * @reserved: allocate from the reserved chunk if available * @gfp: allocation flags * * Allocate percpu area of @size bytes aligned at @align. If @gfp doesn't * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN * then no warning will be triggered on invalid or failed allocation * requests. * * RETURNS: * Percpu pointer to the allocated area on success, NULL on failure. */ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved, gfp_t gfp) { gfp_t pcpu_gfp; bool is_atomic; bool do_warn; enum pcpu_chunk_type type; struct list_head *pcpu_slot; struct obj_cgroup *objcg = NULL; static int warn_limit = 10; struct pcpu_chunk *chunk, *next; const char *err; int slot, off, cpu, ret; unsigned long flags; void __percpu *ptr; size_t bits, bit_align; gfp = current_gfp_context(gfp); /* whitelisted flags that can be passed to the backing allocators */ pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN); is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL; do_warn = !(gfp & __GFP_NOWARN); /* * There is now a minimum allocation size of PCPU_MIN_ALLOC_SIZE, * therefore alignment must be a minimum of that many bytes. * An allocation may have internal fragmentation from rounding up * of up to PCPU_MIN_ALLOC_SIZE - 1 bytes. */ if (unlikely(align < PCPU_MIN_ALLOC_SIZE)) align = PCPU_MIN_ALLOC_SIZE; size = ALIGN(size, PCPU_MIN_ALLOC_SIZE); bits = size >> PCPU_MIN_ALLOC_SHIFT; bit_align = align >> PCPU_MIN_ALLOC_SHIFT; if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE || !is_power_of_2(align))) { WARN(do_warn, "illegal size (%zu) or align (%zu) for percpu allocation\n", size, align); return NULL; } ``` 在第 1713 行，檢測要配置給 CPU 的記憶體大小是否對齊。 :::info Reference * [C99 6.5.7 Bitwise shift operators](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf) * [WG14/N1256 - ISO/IEC 9899:TC3 - Footnotes 44, 45](http://port70.net/~nsz/c/c99/n1256.html#note44) * C11 N2310 ISO/IEC 9899:20172x(E) * [linuxfound - slab/slob/slub](https://events.static.linuxfound.org/sites/events/files/slides/slaballocators.pdf) * [YouTube](https://www.youtube.com/watch?v=h0VMLXavx30) * [Slab-allocators](https://hackmd.io/@VIRqdo35SvekIiH4p76B7g/rJ-UJ_uWx?type=view#Slab-allocators) * [Linux 核心設計 - 記憶體管理](https://hackmd.io/@sysprog/linux-memory?type=view) * [OSTEP 17.4](https://pages.cs.wisc.edu/~remzi/OSTEP/vm-freespace.pdf) ::: --- ## 測驗 `3` ## 1 解釋程式運作原理與重寫 ### 解釋 ```cpp void bitcpy(void *_dest, /* Address of the buffer to write to */ size_t _write, /* Bit offset to start writing to */ const void *_src, /* Address of the buffer to read from */ size_t _read, /* Bit offset to start reading from */ size_t count) size_t read_lhs = _read & 7; size_t read_rhs = 8 - read_lhs; const uint8_t *source = (const uint8_t *) _src + (_read / 8); size_t write_lhs = _write & 7; size_t write_rhs = 8 - write_lhs; uint8_t *dest = (uint8_t *) _dest + (_write / 8); ``` * `_write` 與 `_read` 是為了確保 bit 在做完複製後期結果為符合我們預期 (即想要的偏移量)。可以看到在第 4 到第 8 行對讀取資料的位置進行了修正： ```cpp= while (count > 0) { uint8_t data = *source++; size_t bitsize = (count > 8) ? 8 : count; if (read_lhs > 0) { data <<= read_lhs; if (bitsize > read_rhs) data |= (*source >> read_rhs); } if (bitsize < 8) data &= read_mask[bitsize]; uint8_t original = *dest; uint8_t mask = read_mask[write_lhs]; if (bitsize > write_rhs) { /* Cross multiple bytes */ *dest++ = (original & mask) | (data >> write_lhs); original = *dest & write_mask[bitsize - write_rhs]; *dest = original | (data << write_rhs); } else { // Since write_lhs + bitsize is never >= 8, no out-of-bound access. mask |= write_mask[write_lhs + bitsize]; *dest++ = (original & mask) | (data >> write_lhs); } count -= bitsize; } ``` * 在第 13 到第 24 行處理寫入的操作，以 `bitsize > write_rhs` 分成了跨位元組與否的複製。 * 寫入前對寫入目的地與 mask 的前置動作在第 13 及第 14 行。 * `uint8_t original = *dest;` 儲存原先資訊 * `uint8_t mask = read_mask[write_lhs];` 設置 mask 為寫入的資料偏移位置做處理 * 多位元組複製 - 第一個位元組 * `original & mask` (清除要放入位置的資料) * `data >> write_lhs` (調整 data 到我們要放入的位置) * `*dest++ = (original & mask) | (data >> write_lhs);` 寫入 * 多位元組複製 - 第二個位元組 * `original = *dest & write_mask[bitsize - write_rhs];` 重新利用 `original` 為下一個位元組要放入的資料的 mask 。 * `*dest = original | (data << write_rhs);` 寫入 > example: > write in _ _ _ _ _ _ 1 1 , 1 1 _ _ _ _ _ _ > write_rhs = 2 > write_lhs = 6 ### 改寫 ```cpp #define read_mask(t) ((uint8_t)(0xFF << (8 - t))) #define write_mask(t) (~read_mask((t))) //if (bitsize < 8) data &= read_mask(bitsize); uint8_t mask = read_mask(write_lhs); if (bitsize > write_rhs) { *dest = (*dest & mask) | (data >> write_lhs); dest++; *dest = *dest & write_mask(bitsize - write_rhs); *dest = *dest | (data << write_rhs); } else { // Since write_lhs + bitsize is never >= 8, no out-of-bound access. mask |= write_mask(write_lhs + bitsize); *dest++ = (*dest & mask) | (data >> write_lhs); } count -= bitsize; ``` ## 2 Linux 核心內的 bit 複製 - [ ] [lib/bitmap.c](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/lib/bitmap.c?h=v5.12-rc2#n292) ```cpp=292 void __bitmap_replace(unsigned long *dst, const unsigned long *old, const unsigned long *new, const unsigned long *mask, unsigned int nbits) { unsigned int k; unsigned int nr = BITS_TO_LONGS(nbits); for (k = 0; k < nr; k++) dst[k] = (old[k] & ~mask[k]) | (new[k] & mask[k]); } EXPORT_SYMBOL(__bitmap_replace); ``` - [ ] [drivers/video/fbdev/amifb.c](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/video/fbdev/amifb.c?h=v5.12-rc2#n2602) ```cpp=2597 /* * Unaligned forward bit copy using 32-bit or 64-bit memory accesses */ static void bitcpy(unsigned long *dst, int dst_idx, const unsigned long *src, int src_idx, u32 n) ... ``` ### 圖形處理的 Framebuffer 引述自 [Stéphane Marchesin - Linux Graphics Drivers: an Introduction](https://people.freedesktop.org/~marcheu/linuxgraphicsdrivers.pdf) ： > ![](https://imgur.com/6guYxcV.png) > **4.2 Framebuffer operations** > The framebuffer operations structure is how non-modesetting framebuffer callbacksare set. Different callbacks can be set depending on what functionality you wish to implement, like fills, copies, or cursor handling. By filling struct fb_ops callbacks, one can implement the following functions: > > Copy data from area to another > ```cpp > void fb_copyarea(struct fb_info ∗info, const struct fb_copyarea ∗region) > ``` 而 `fb_copyarea` 是被包在 `fb_ops` 結構中，在 [drivers/video/fbdev/amifb.c](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/video/fbdev/amifb.c?h=v5.12-rc2#n3500) 初始為以下： ```cpp=3500 static const struct fb_ops amifb_ops = { .owner = THIS_MODULE, .fb_check_var = amifb_check_var, .fb_set_par = amifb_set_par, .fb_setcolreg = amifb_setcolreg, .fb_blank = amifb_blank, .fb_pan_display = amifb_pan_display, .fb_fillrect = amifb_fillrect, .fb_copyarea = amifb_copyarea, .fb_imageblit = amifb_imageblit, .fb_ioctl = amifb_ioctl, }; ``` `amifb_copyarea` 則一樣在 [drivers/video/fbdev/amifb.c](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/video/fbdev/amifb.c?h=v5.12-rc2#n3246) 中： ```cpp=3246 static void amifb_copyarea(struct fb_info *info, const struct fb_copyarea *area) ... if (rev_copy) { while (height--) { dst_idx -= par->next_line * 8; src_idx -= par->next_line * 8; copy_one_line_rev(info->var.bits_per_pixel, par->next_plane, dst, dst_idx, src, src_idx, width); } } else { while (height--) { copy_one_line(info->var.bits_per_pixel, par->next_plane, dst, dst_idx, src, src_idx, width); dst_idx += par->next_line * 8; src_idx += par->next_line * 8; } } } ``` 其中 [`copy_one_line`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/video/fbdev/amifb.c?h=v5.12-rc2#n3211) 使用到 `bitcpy` ： ```cpp=3211 static inline void copy_one_line(int bpp, unsigned long next_plane, unsigned long *dst, int dst_idx, unsigned long *src, int src_idx, u32 n) { while (1) { dst += dst_idx >> SHIFT_PER_LONG; dst_idx &= (BITS_PER_LONG - 1); src += src_idx >> SHIFT_PER_LONG; src_idx &= (BITS_PER_LONG - 1); bitcpy(dst, dst_idx, src, src_idx, n); if (!--bpp) break; dst_idx += next_plane * 8; src_idx += next_plane * 8; } } ``` ### user and kernel space 是以 byte 單位。引自 [Oracle® Linux 6Porting Guide](https://docs.oracle.com/en/operating-systems/oracle-linux/6/porting/section_ohm_jhk_tm.html) > You can use the `copy_from_user()` and `copy_to_user()` functions to move data between kernel space and user space. Alternatively, **when moving one, two, four, or eight bytes of data**, you can use either `put_user()` and `get_user()` or `access_ok()` to validate the user-space address followed by either `__put_user()` or` __get_user()`. > > If user programs require direct access to device memory, you can use the mmap() system call, which maps device memory directly into user space. For example, the X server uses mmap() to write to video adapter memory and PCI devices usually memory map their control registers to improve performance. A limitation is that the area being mapped has to be a multiple of PAGE_SIZE and start at a physical memory address that is also a multiple of PAGE_SIZE. ### 額外查閱資料時看到此函式，覺得有趣也上放來。在 [linux-kernel-labs.github.io](https://linux-kernel-labs.github.io/refs/heads/master/labs/memory_mapping.html) 的 memory mapping 有看到 > Enable the PG_reserved bit of each page with SetPageReserved(). Clear the bit with ClearPageReserved() before freeing the memory. 其中 `SetPageReserved()` 為 `#define SetPageReserved(page) set_bit(PG_reserved, &(page)->flags)` 。而此作用在 [linux-kernel-labs.github.io - memory mapping](https://linux-kernel-labs.github.io/refs/heads/master/labs/memory_mapping.html) 是 > Since the pages are mapped to user space, they might be swapped out. To avoid this we must set the PG_reserved bit on the page. Enabling is done using SetPageReserved() while reseting it (which must be done before freeing the memory) is done with ClearPageReserved(): 因此去查了原始碼竟然是用 **inline assembly** 實作，是為了確保是 **atomic** 層級的操作 - [ ] [`arch/alpha/include/asm/bitops.h`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/alpha/include/asm/bitops.h?h=v5.12-rc2#n28) ```cpp /* * These have to be done with inline assembly: that way the bit-setting * is guaranteed to be atomic. All bit operations return 0 if the bit * was cleared before the operation and != 0 if it was not. * * To get proper branch prediction for the main line, we must branch * forward to code at the end of this object's .text section, then * branch back to restart the operation. * * bit 0 is the LSB of addr; bit 64 is the LSB of (addr+1). */ static inline void set_bit(unsigned long nr, volatile void * addr) { unsigned long temp; int *m = ((int *) addr) + (nr >> 5); __asm__ __volatile__( "1: ldl_l %0,%3\n" " bis %0,%2,%0\n" " stl_c %0,%1\n" " beq %0,2f\n" ".subsection 2\n" "2: br 1b\n" ".previous" :"=&r" (temp), "=m" (*m) :"Ir" (1UL << (nr & 31)), "m" (*m)); } ``` :::info Reference * [wikipedia - Bit array](https://en.wikipedia.org/wiki/Bit_array) * Linux kernel source tree * [root/lib/bitmap.c](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/lib/bitmap.c?h=v5.12-rc2) * [root/drivers/video/fbdev/amifb.c](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/video/fbdev/amifb.c?h=v5.12-rc2) * [Stéphane Marchesin - Linux Graphics Drivers: an Introduction](https://people.freedesktop.org/~marcheu/linuxgraphicsdrivers.pdf) * [Oracle® Linux 6Porting Guide](https://docs.oracle.com/en/operating-systems/oracle-linux/6/porting/section_ohm_jhk_tm.html) * [developer.ibm.com - User space memory access from the Linux kernel](https://developer.ibm.com/technologies/linux/articles/l-kernel-memory-access/) * [linux-kernel-labs.github.io - memory mapping](https://linux-kernel-labs.github.io/refs/heads/master/labs/memory_mapping.html) * [blog.csdn.net - Linux驱动开发杂记(0x0C) - SetPageReserved()](https://blog.csdn.net/u011471873/article/details/84105133) ::: --- ## 測驗 `4` ## 1 解釋程式運作原理 ### 結構定義 // thing, will put int to the box ```cpp typedef struct __cstr_data { char *cstr; uint32_t hash_size; uint16_t type; uint16_t ref; } * cstring; typedef struct __cstr_buffer { cstring str; } cstr_buffer[1]; ``` // control all the thing // __cstr_node is box // pool is the place store those boxes // hash => index_0 index_1, index -> linked list // pool => all the node ```cpp struct __cstr_node { char buffer[CSTR_INTERNING_SIZE]; struct __cstr_data str; struct __cstr_node *next; }; struct __cstr_pool { struct __cstr_node node[INTERNING_POOL_SIZE]; }; struct __cstr_interning { int lock; int index; unsigned size; unsigned total; struct __cstr_node **hash; struct __cstr_pool *pool; }; ``` // memory level ```cpp enum { CSTR_PERMANENT = 1, CSTR_INTERNING = 2, CSTR_ONSTACK = 4, }; #define CSTR_INTERNING_SIZE (32) #define CSTR_STACK_SIZE (128) ``` 因為函式個別解釋會太過瑣碎，因此我以 `test_cstr` 為起始講解： ```cpp static void test_cstr() { CSTR_BUFFER(a); cstr_cat(a, "Hello "); cstr_cat(a, "string"); cstring b = cmp(CSTR_S(a)); printf("%s\n", b->cstr); CSTR_CLOSE(a); cstr_release(b); } ``` * `CSTR_BUFFER` * 以 macro 宣告一個以 `cstr_buffer` 為主的物件。 * 其中包含了 `CSTR_STACK_SIZE` 大小為 128 Bytes 儲存在 stack 的小字串。 * `__cstr_data` 結構 ( aka cstring )，用來儲存資訊，是 `cstr_buffer` 唯一指向的物件。 ```cpp #define CSTR_BUFFER(var) \ char var##_cstring[CSTR_STACK_SIZE] = {0}; \ struct __cstr_data var##_cstr_data = {var##_cstring, 0, CSTR_ONSTACK, 0}; \ cstr_buffer var; \ var->str = &var##_cstr_data; ``` * `cstr_cat` * 逐字元增加字串，若 `type` 非 `CSTR_ONSTACK` 或是但超出 `CSTR_STACK_SIZE` 則進入 `cstr_cat2` 。 * `cstr_cat2` 則會開始處理到中字串與大字串的範疇。 * `if (*str == 0)` 為判斷加上的字串是否在 `CSTR_STACK_SIZE` 層級內加完。 * `s->cstr[i] = 0;` 則是為了確保不要在進入 `cstr_cat2` 使接下來讀取字串產生問題。 ```cpp cstring cstr_cat(cstr_buffer sb, const char *str) { cstring s = sb->str; if (s->type == CSTR_ONSTACK) { int i = s->hash_size; while (i < CSTR_STACK_SIZE - 1) { s->cstr[i] = *str; if (*str == 0) return s; ++s->hash_size; ++str; ++i; } s->cstr[i] = 0; } cstring tmp = s; sb->str = cstr_cat2(tmp->cstr, str); cstr_release(tmp); return sb->str; } ``` * `cstr_cat2` * `if (sa + sb < CSTR_INTERNING_SIZE)` 是在判斷是否是在 CSTR_INTERNING 層級。 * 若是，則會在處理完字串後，進入 `cstr_interning` 內。 * 當不是時則進行 xalloc ，把 type 設 0 、ref 設 1 、hash_size 設 0 回傳到原先的 `sb->str` 。 ```cpp static cstring cstr_cat2(const char *a, const char *b) { size_t sa = strlen(a), sb = strlen(b); if (sa + sb < CSTR_INTERNING_SIZE) { char tmp[CSTR_INTERNING_SIZE]; memcpy(tmp, a, sa); memcpy(tmp + sa, b, sb); tmp[sa + sb] = 0; return cstr_interning(tmp, sa + sb, hash_blob(tmp, sa + sb)); } cstring p = xalloc(sizeof(struct __cstr_data) + sa + sb + 1); if (!p) return NULL; char *ptr = (char *) (p + 1); p->cstr = ptr; p->type = 0; p->ref = 1; memcpy(ptr, a, sa); memcpy(ptr + sa, b, sb); ptr[sa + sb] = 0; p->hash_size = 0; return p; } ``` * `hash_blob` * 此函式為計算字串在 hash table 中的值。 ```cpp static inline uint32_t hash_blob(const char *buffer, size_t len) { const uint8_t *ptr = (const uint8_t *) buffer; size_t h = len; size_t step = (len >> 5) + 1; for (size_t i = len; i >= step; i -= step) h = h ^ ((h << 5) + (h >> 2) + ptr[i - 1]); return h == 0 ? 1 : h; } ``` * cstr_interning * 找出在 `__cstr_ctx` 裡儲存的 `hash` 中符合 `cstr` 的字串。 * 現存總數 total 會加一。 ```cpp static cstring cstr_interning(const char *cstr, size_t sz, uint32_t hash) { cstring ret; CSTR_LOCK(); ret = interning(&__cstr_ctx, cstr, sz, hash); if (!ret) { expand(&__cstr_ctx); ret = interning(&__cstr_ctx, cstr, sz, hash); } ++__cstr_ctx.total; CSTR_UNLOCK(); return ret; } ``` * interning * 在 `struct __cstr_interning *si` 中的 `hash` 尋找符合的字串。 * 以剛剛計算在 `hash_blob` 的 `uint32_t hash` 值除以 `si->size - 1` 的餘數為 `index` ，`hash[index]` 中開始尋找。 * 之所以能 `hash & (si->size - 1)` ，是因為有確保每次更新 `si->size` 都為二的倍數。 * 如果都沒找到則會新增，但當 hash 所擁有的紀錄達到其臨界值 (4/5 == total / size ) 會停止操作。 * 若否，則在 `pool` 新增新的 `node` 並在 `hash[index]` 標記。 > 疑慮，是否有確保 `n = &si->pool->node[si->index++];` 在大於 `INTERNING_POOL_SIZE` 時能正常運作。 * 不論尋找到或是直接新增，回傳值均為 `cstring` 型態。 ```cpp static cstring interning(struct __cstr_interning *si, const char *cstr, size_t sz, uint32_t hash) { if (!si->hash) return NULL; int index = (int) (hash & (si->size - 1)); struct __cstr_node *n = si->hash[index]; while (n) { if (n->str.hash_size == hash) { if (!strcmp(n->str.cstr, cstr)) return &n->str; } n = n->next; } // 80% (4/5) threshold if (si->total * 5 >= si->size * 4) return NULL; if (!si->pool) { si->pool = xalloc(sizeof(struct __cstr_pool)); si->index = 0; } n = &si->pool->node[si->index++]; memcpy(n->buffer, cstr, sz); n->buffer[sz] = 0; cstring cs = &n->str; cs->cstr = n->buffer; cs->hash_size = hash; cs->type = CSTR_INTERNING; cs->ref = 0; n->next = si->hash[index]; si->hash[index] = n; return cs; } ``` * `expand` * 當 `hash` 的大小不夠或初次呼叫時，會呼叫此函式。 * 初次呼叫 `hash` 大小會直接是 `HASH_START_SIZE 16`。 * `unsigned new_size = si->size * 2;` ，此操作能確保 `size` 為 2 的次方。 * 會以 `insert_node` 一一複製原先的 hash所擁有的資料。 ```cpp static void expand(struct __cstr_interning *si) { unsigned new_size = si->size * 2; if (new_size < HASH_START_SIZE) new_size = HASH_START_SIZE; struct __cstr_node **new_hash = xalloc(sizeof(struct __cstr_node *) * new_size); memset(new_hash, 0, sizeof(struct __cstr_node *) * new_size); for (unsigned i = 0; i < si->size; ++i) { struct __cstr_node *node = si->hash[i]; while (node) { struct __cstr_node *tmp = node->next; insert_node(new_hash, new_size, node); node = tmp; } } free(si->hash); si->hash = new_hash; si->size = new_size; } ``` * cstr_release * 釋放的條件是 type 不為零、 ref 為 0 。 * > [5.44 Built-in functions for atomic memory access - `__sync_sub_and_fetch` ](https://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Atomic-Builtins.html) ```cpp void cstr_release(cstring s) { if (s->type || !s->ref) return; if (__sync_sub_and_fetch(&s->ref, 1) == 0) free(s); } ``` **至此，`cstr_cat` 的操作才算完成。** //TODO release close 接下來的函式是在針對單一字串進行讀取使所用的函式與巨集。 * cstr_grab * 可以看到在對不同的 `type` 有不同的對應。 * 當是 `CSTR_PERMANENT` 或 `CSTR_INTERNING` 會直接回傳。 * 但如果是 `CSTR_ONSTACK` 會進行 `cstr_clone` 。 * `cstr_clone` 則是在 s->hash_size 小於 CSTR_INTERNING_SIZE時，把 `cstring s` 記入到 `hash` 裡。 * 如超出大小則複製，並回傳副本。 ```cpp cstring cstr_grab(cstring s) { if (s->type & (CSTR_PERMANENT | CSTR_INTERNING)) return s; if (s->type == CSTR_ONSTACK) return cstr_clone(s->cstr, s->hash_size); if (s->ref == 0) s->type = CSTR_PERMANENT; else __sync_add_and_fetch(&s->ref, 1); return s; } ``` ```cpp cstring cstr_clone(const char *cstr, size_t sz) { if (sz < CSTR_INTERNING_SIZE) return cstr_interning(cstr, sz, hash_blob(cstr, sz)); cstring p = xalloc(sizeof(struct __cstr_data) + sz + 1); if (!p) return NULL; void *ptr = (void *) (p + 1); p->cstr = ptr; p->type = 0; p->ref = 1; memcpy(ptr, cstr, sz); ((char *) ptr)[sz] = 0; p->hash_size = 0; return p; } ``` ### uthash 關於 hash ，之前在看論壇時有人有推薦可以用 [uthash](https://troydhanson.github.io/uthash/userguide.html) 這個東西。此並非是 C 的函式庫，只是個標頭檔因此使用起來很方便。 > **Add, find and delete are normally constant-time operations.** This is influenced by your key domain and the hash function. > [原始碼](https://github.com/troydhanson/uthash/blob/master/src/uthash.h) 蠻妙的是它是系列相關的專案，包含了 [utstring](https://troydhanson.github.io/uthash/utstring.html) 。 ## 2 string interning in 多執行緒 ### `atomic` and `__sync` * `atomic` and `__sync` * `__sync` 的實作在某些硬體上根本沒有，官方也推薦新的程式碼該用別的 * [gcc - 6.54 Legacy __sync Built-in Functions for Atomic Memory Access ](https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.html#g_t_005f_005fsync-Builtins) > These functions are implemented in terms of the ‘__atomic’ builtins (see __atomic Builtins). They should not be used for new code which should use the ‘__atomic’ builtins instead. * [gcc.gnu.org wiki atomic GCCMM](https://gcc.gnu.org/wiki/Atomic/GCCMM) ### `cstr_cat` - 字串完整性 * 第一次測試程式碼，完整程式碼請點看 [GitHub](https://github.com/linD026/linux2021_homework_quiz2_string_interning) 。 ```cpp static void test_cstr(void *buf) { cstr_buffer *a = (cstr_buffer *)buf; cstr_cat((*a), "a"); cstr_cat((*a), "a"); } static void test0(void) { #ifdef times #undef times #endif #define times 1000000 for (int i = 0; i < times; i++) { CSTR_BUFFER(a); //CSTR_BUFFER(b); pthread_t thread_0, thread_1; pthread_create(&thread_1, NULL, (void *)test_cstr, (void *)a); pthread_create(&thread_0, NULL, (void *)test_cstr, (void *)a); pthread_join(thread_0, NULL); pthread_join(thread_1, NULL); CSTR_LITERAL(hello, "aaaa"); if (!cstr_equal(hello, CSTR_S(a))) { printf("not equal : "); printf("%s\n", CSTR_S(a)->cstr); } CSTR_CLOSE(a); //CSTR_CLOSE(b); } #undef times } ``` * 會出現： ```shell= $ gcc -o thread0 cstr.c thread.c -g -lpthread $ ./thread0 not equal : a not equal : aa not equal : aa not equal : a not equal : a not equal : a not equal : aa not equal : aa not equal : aa not equal : a not equal : aa ``` * 之後做了修改，雖然是可以正常運行了，但 LOCK 的範圍太大，效率不高。 * 原先只有在 `cstr_interning` 函式亦即對 `hash` 操作時作保護，但對於字串操控還有 `cstr_cat` 這個函式。 ```cpp cstring cstr_cat(cstr_buffer sb, const char *str) { cstring s = sb->str; CSTR_LOCK(); if (s->type == CSTR_ONSTACK) { int i = s->hash_size; while (i < CSTR_STACK_SIZE - 1) { s->cstr[i] = *str; if (*str == 0) { CSTR_UNLOCK(); return s; } ++s->hash_size; ++str; ++i; } s->cstr[i] = 0; } cstring tmp = s; CSTR_UNLOCK(); sb->str = cstr_cat2(tmp->cstr, str); cstr_release(tmp); return sb->str; } ``` * 但是上述 `test_cstr` 沒辦法檢測對 `string interning` 的操作，因此我改成如下： ```cpp static void test_cstr(void *buf) { cstr_buffer *a = (cstr_buffer *)buf; cstr_cat((*a), "aaaa"); (*a)->str = cstr_grab(CSTR_S(*a)); cstr_cat((*a), "aaaa"); } ``` * 字串變成需要 16 個 a ，每次增加 4 個 a 。 * 當處理兩個不同字串時是可以的。 * 但仍就沒有完全成功。在處理同字串會失敗，因為函式間沒法保證讀取時會是預期狀況。 * 在 10000 次測試下，會有 4951 次會是失敗的，跟擲骰子差不多完全不可靠。 * 最後解決方案，因為考量到沒辦法在處理字串時每個步驟都 `lock` ，因此直接在另設個 `lock` 在呼叫時包起來。 * **但這有點治標不致本，會使用者須額外負擔分辨是否操作的視同一個字串，並且還要自己設字串的 lock 。** ```cpp volatile atomic_flag alock = ATOMIC_FLAG_INIT; #define a_lock()\ ({while (atomic_flag_test_and_set(&alock));}) #define a_unlock()\ ({atomic_flag_clear(&(alock));}) #define cstr_cat_s(s, a) do{\ a_lock();\ cstr_cat((s), a);\ a_unlock();} while (0) #define cstr_grab_s(from, to) do{\ a_lock();\ (to) = cstr_grab(from);\ a_unlock();} while (0) cstr_cat_s((*a), "aaaa"); cstr_grab_s(CSTR_S(*a), (*a)->str); cstr_cat_s((*a), "aaaa"); ``` ```shell= gcc -o thread_lock cstr_t.c thread.c -g -lpthread $ for i in {1..10}; do ./thread_lock; done $ ``` ### ref - 操作的保證 * 目標為在多執行緒下，能夠掌控 `ref` 的數值大小。 * 因此以 4 到 10 個執行緒測試了以 `cstr_grab` 和 `cstr_release` 對 `ref` 進行加減的實驗。 * 在重複執行100次時看似沒問題，但拉高到1000次就可以看到有時候會出錯如： > ref: 3, type 0 > ref: 0, type 32750 > > `type` 會是 32750 是因為保存其的 `cstring s` 已被釋放掉。 * 被釋放掉則是因為 `ref` 在增加之前已被減為零。 ```cpp static void test1(void) { #define times 1 #define thread_n 4 for (int i = 0; i < times; i++) { CSTR_BUFFER(a); cstr_cat_s(a, "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC"); pthread_t thread[thread_n]; int j; for (j = 0; j < 2; j++) pthread_create(&thread[j], NULL, (void *)test_ref_create_ns, (void *)a); for (;j < 4;j++) pthread_create(&thread[j], NULL, (void *)test_ref_del_ns, (void *)a); for (int j = 0; j < thread_n; j++) pthread_join(thread[j], NULL); printf("ref: %d, type %d\n", a->str->ref, a->str->type); CSTR_CLOSE(a); } #undef times } static void test_ref_create_ns(void *buf) { cstr_buffer *a = (cstr_buffer *)buf; (*a)->str = cstr_grab(CSTR_S(*a)); //add (*a)->str = cstr_grab(CSTR_S(*a)); //add } static void test_ref_del_ns(void *buf) { cstr_buffer *a = (cstr_buffer *)buf; cstr_release((*a)->str); //sub } ``` * 因此就如 `cat` 一樣方式處理，但須加上 `sleep()` ： :::warning 關於老師所說 > 這是什麼可怕的解法？是可以運用 [`reader and writer`](https://en.wikipedia.org/wiki/Readers%E2%80%93writers_problem) 的方法來處理。設一個減值的 `lock` 擋住 `release` 但不擋住增值操作，之後再設一個 `semaphore` 來記錄目前增值有幾個。當那個 `semaphore` 為零代表增值操作已經完成， `lock` 才解除讓減值操作。但當時候只為了區分兩種操作，為了讓實驗方便操作所以才直接設 `sleep()` 。 ::: ```cpp static void test_ref_create(void *buf) { cstr_buffer *a = (cstr_buffer *)buf; cstr_grab_s(CSTR_S(*a), (*a)->str); cstr_grab_s(CSTR_S(*a), (*a)->str); } static void test_ref_del(void *buf) { cstr_buffer *a = (cstr_buffer *)buf; cstr_release_s((*a)->str); } ``` * 之後試了幾種組合，如果說要對 `ref` 在個執行緒保持一致性對函式 lock 即可。 * 但如果是要以要保證最後的 `ref` 值為何則須以分開增減值操作。 * [detail](https://github.com/linD026/linux2021_homework_quiz2_string_interning/blob/main/report_ref.txt) * 然而上述所的不包含 hash 存放的值，因為其以保持在 `ref` 為零的狀態且只可讀取。 #### 總結雖然有對共享值作如 `__sync_sub_and_fetch` 、 `__sync_add_and_fetch` 等保護，但在讀取數值與執行順序上卻沒有保證，關於這概念可以看 [並行程式設計: 執行順序 - Sequential Consistency](https://hackmd.io/@sysprog/concurrency/https%3A%2F%2Fhackmd.io%2F%40sysprog%2Fconcurrency-ordering#%E6%9C%80%E7%9B%B4%E8%A6%BA%E7%9A%84%E7%B4%84%E5%AE%9A-Sequential-Consistency)。簡單來說，存到 cache 要作增減的值必須要保持在最新的狀態，而當其他執行緒要去存取時也要能夠看到前者的更新資訊。 :::info Reference * [GitHub - ThreadSanitizerReportFormat](https://github.com/google/sanitizers/wiki/ThreadSanitizerReportFormat) * [sourceware.org - gdb - 4.10 Debugging Programs with Multiple Threads](https://sourceware.org/gdb/current/onlinedocs/gdb/Threads.html) * [cyberciti.biz - Linux / UNIX: Run Command a Number of Times In a Row](https://www.cyberciti.biz/faq/bsd-appleosx-linux-bash-shell-run-command-n-times/) * [gcc.gnu.org - 5.44 Built-in functions for atomic memory access](https://gcc.gnu.org/onlinedocs/gcc-4.1.1/gcc/Atomic-Builtins.html) * [preshing.com - An Introduction to Lock-Free Programming](https://preshing.com/20120612/an-introduction-to-lock-free-programming/) * [cppreference - atomic](https://en.cppreference.com/w/c/atomic) 相關閱讀 * [Linux 核心設計 - 淺談同步機制](https://hackmd.io/@sysprog/linux-sync?type=view) * [並行程式設計: 概念](https://hackmd.io/@sysprog/concurrency/https%3A%2F%2Fhackmd.io%2F%40sysprog%2FS1AMIFt0D) * [computing.llnl.gov - pthread](https://computing.llnl.gov/tutorials/pthreads/) ::: ## 3 chriso/intern 探討與分析

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.