2021q1 Homework2 (quiz2)

contributed by < linD026 >

tags: `linux2021`

2021 年第 2 週隨堂測驗題目

0 進度

測驗 `1`

1 解釋程式運作原理與可改進

結構定義

此 linked list 是以 queue_t 作為函式之間的主要操作物件。
但在實際進行雙向鏈結串列時的插入與排序等操作，幾乎是以 list_head 為主。
還有一個結構 list_ele_t 為鏈結串列的節點作為儲存節點資訊，目的是為了與鏈結操作分開。
list.h

/**
 * struct list_head - Head and node of a doubly-linked list
 * @prev: pointer to the previous node in the list
 * @next: pointer to the next node in the list
 */
struct list_head {
    struct list_head *prev, *next;
};

merge_sort.c

typedef struct __element {
    char *value;
    struct __element *next;
    struct list_head list;
} list_ele_t;

typedef struct {
    list_ele_t *head; /* Linked list of elements */
    list_ele_t *tail;
    size_t size;
    struct list_head list;
} queue_t;

函式說明

1. get_middle










static list_ele_t *get_middle(struct list_head *list)
{
    struct list_head *fast = list->next, *slow;
    list_for_each (slow, list) {
        if (fast->next == list || fast->next->next == list)
            break;
        fast = fast->next->next;
    }
    return list_entry(slow, list_ele_t, list);
}

對於如何得到環形鏈結串列的中間節點，重點在於第 5 行的判斷。
這其實很好理解，和我之前看的 YouTube: If Programming Was An Anime 有關 (雖然裡面主要講的是 floyd's tortoise and hare，但對可以讓利用速度快慢所造成的差異性理解更清楚)。

只要每次 fast 是 slow 的兩倍走訪節點數，就能確保在達成終止條件 if (fast->next == list || fast->next->next == list) 時 slow 在中間節點。
總結點為奇數

總結點為偶數

2. list_merge

static void list_merge(struct list_head *lhs,
                       struct list_head *rhs,
                       struct list_head *head)
{
    INIT_LIST_HEAD(head);
    if (list_empty(lhs)) {
        list_splice_tail(lhs, head);
        return;
    }
    if (list_empty(rhs)) {
        list_splice_tail(rhs, head);
        return;
    }
    while (!list_empty(lhs) && !list_empty(rhs)) {
        char *lv = list_entry(lhs->next, list_ele_t, list)->value;
        char *rv = list_entry(rhs->next, list_ele_t, list)->value;
        struct list_head *tmp = strcmp(lv, rv) <= 0 ? lhs->next : rhs->next;
        list_del(tmp);
        list_add_tail(tmp, head);
    }
    list_splice_tail(list_empty(lhs) ? rhs : lhs, head);
}

此函式為 merge sort 重新連接的步驟，函式輸入 lhs (left), rhs (right), head (result) 。

當 lhs (left) 或 rhs (right) 其中一方為 emtpy 時，直接把另一方連結到 head 尾端。
否則就每次雙方依序比較節點大小，把較小的值儲存到 tmp 並刪除其在原先位置，隨後連接到 head 尾端，並重複此步驟直到兩者皆為 empty 。

3. list_merge_sort

void list_merge_sort(queue_t *q)
{
    if (list_is_singular(&q->list))
        return;

    queue_t left;
    struct list_head sorted;
    INIT_LIST_HEAD(&left.list);
    list_cut_position(&left.list, &q->list, &get_middle(&q->list)->list);
    list_merge_sort(&left);
    list_merge_sort(q);
    list_merge(&left.list, &q->list, &sorted);
    INIT_LIST_HEAD(&q->list);
    list_splice_tail(&sorted, &q->list);
}

此函式為 merge sort 主要遞迴程式。

先以環形鏈結串列的中間截斷作為分節點，以 list_cut_position 把 q 的左半部(前半段)節點移動到 left 上。
left 和 q 分別進行原函式遞迴，之後再利用 list_merge 合併到 sorted。
最後初始化 q ，把排序完畢的鏈結串列存回去 q 。

Reference

改良結構與連結方式

我認為結構可以改成：

node

typedef struct __element {
    char *value;
    struct list_head list;
} list_ele_t;

head

typedef struct {
    size_t size;
    struct list_head list;
} queue_t;

把 head 和 node 的結構所包含的資訊分開，讓兩者持有自己類別的資訊。

因此關於相關的函式操作需更改成下列形式：

4. q_new

static queue_t *q_new()
{
    queue_t *q = malloc(sizeof(queue_t));
    if (!q) return NULL;
    q->size = 0;
    INIT_LIST_HEAD(&q->list);
    return q;
}

5. q_free

static void q_free(queue_t *q)
{
    if (!q) return;
    struct list_head *item, *is = NULL;
    // list_for_each_safe
    for (item = q->list.next, is = item->next; item != &q->list; item = is, is = item->next) {
        list_ele_t *temp = list_entry(item, list_ele_t, list);
        list_del(&temp->list);
        free(temp->value);
        free(temp);
    }
    free(q);
}

6. q_insert_head

bool q_insert_head(queue_t *q, char *s)
{
    if (!q) return false;

    list_ele_t *new_node = malloc(sizeof(list_ele_t));
    if (!new_node) return false;

    
    new_node->value = strdup(s);
    if (!new_node->value) {
        free(new_node);
        return false;
    }

    q->size++;
    list_add_tail(&new_node->list, &q->list);

    return true;
}

改之前

$ gcc -o test test.c -g
$ valgrind --tool=memcheck --track-origins=yes ./test
==2384== error calling PR_SET_PTRACER, vgdb might block
==2384==
==2384== HEAP SUMMARY:
==2384==     in use at exit: 0 bytes in 0 blocks
==2384==   total heap usage: 163,245 allocs, 163,245 frees, 4,598,645 bytes allocated
==2384==
==2384== All heap blocks were freed -- no leaks are possible
==2384==
==2384== For lists of detected and suppressed errors, rerun with: -s
==2384== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

4,598,645 bytes allocated

改之後

$ gcc -o merge merge_sort.c -g
$ valgrind --tool=memcheck --track-origins=yes ./merge
==2339== error calling PR_SET_PTRACER, vgdb might block
==2339==
==2339== HEAP SUMMARY:
==2339==     in use at exit: 0 bytes in 0 blocks
==2339==   total heap usage: 163,245 allocs, 163,245 frees, 3,945,661 bytes allocated
==2339==
==2339== All heap blocks were freed -- no leaks are possible
==2339==
==2339== For lists of detected and suppressed errors, rerun with: -s
==2339== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

3,945,661 bytes allocated

2 Linux 核心 - lib/list_sort.c

實作可單獨執行 (standalone) 的使用層級應用程式

list_sort.c

完整實作可見: main.c

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>

#include "list.h"
#include "list_sort.h"

#define len 50
u16 values[len];
int main(int argc, char **argv) {
  // create
  struct list_head testlist;
  arrangement order = ascend;
  struct listitem *item, *is = NULL;
  size_t i;

  u16 seed = (uintptr_t)*argv;
  random_array(values, (u16)ARRAY_SIZE(values), seed);

  INIT_LIST_HEAD(&testlist);

  for (i = 0;i < ARRAY_SIZE(values);i++) {
    item = (struct listitem*)malloc(sizeof(struct listitem));
    item->i = values[i];
    list_add(&item->list, &testlist);
  }
  // sort
  list_sort(&order, &testlist, cmpu16);

  if(check(&testlist, &order, cmpu16))
    printf("in order\n");
  else 
    printf("not in order\n");

  print_check(&testlist);

  // free
  return 0;
}

list.h

完整實作可見: list.h

#ifndef _LINUX_LIST_H
#define _LINUX_LIST_H

#define u8 uint8_t
#define u16 uint16_t
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect((x), 0)
#define ARRAY_SIZE(x) (sizeof(x) / sizeof(x[0]))

struct list_head {
  struct list_head *next, *prev;
};

struct listitem {
  u16 i;
  struct list_head list;
};

typedef enum {
  ascend,
  descend
} arrangement;

int cmp(void *arrange, const struct list_head  *a,
                      const struct list_head  *b) {
  const u16 ai = (const u16 )container_of(a, struct listitem, list)->i;
  const u16 bi = (const u16 )container_of(b, struct listitem, list)->i;
  const arrangement *arr_temp = (arrangement *)arrange;

  if (*arr_temp == descend)
    return bi - ai;
  else
    return ai - bi;
}

int check(struct list_head *head, void *piv, int (*cmp_func)(void *, const struct list_head *, const struct list_head *)
){
    struct list_head *item, *is = NULL;
    u16 det = true;
    const struct list_head *value = NULL;
    list_for_each_safe(item, is, head){
        if (det) {
            value = item;
            det = false;
        } else {
            if (cmp_func(piv, value, item) > 0 && *(arrangement*)piv == ascend)
                return false;
            value = item;
        }
    }
    return true;
}
#endif

$ gcc -o test list_sort.c -g
$ valgrind --tool=memcheck --track-origins=yes ./test
==7455== error calling PR_SET_PTRACER, vgdb might block
in order
2 8 12 27 40 42 64 96 102 121 144 192 198 223 264 283 332 356 390 501
==7455==
==7455== HEAP SUMMARY:
==7455==     in use at exit: 480 bytes in 20 blocks
==7455==   total heap usage: 21 allocs, 1 frees, 4,576 bytes allocated
==7455==
==7455== LEAK SUMMARY:
==7455==    definitely lost: 24 bytes in 1 blocks
==7455==    indirectly lost: 456 bytes in 19 blocks
==7455==      possibly lost: 0 bytes in 0 blocks
==7455==    still reachable: 0 bytes in 0 blocks
==7455==         suppressed: 0 bytes in 0 blocks
==7455== Rerun with --leak-check=full to see details of leaked memory
==7455==
==7455== For lists of detected and suppressed errors, rerun with: -s
==7455== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Reference

stackoverflow - Is attribute((nonnull)) is standard C
arm keil - attribute((nonnull))) function attribute
bootlin.com - linux
likely, unlikely 巨集與 GCC 內建函式 __builtin_expect
Linux 原始程式碼 (GitHub)
include/linux/list.h

設計檢測

analysis.c

完整實作可見: analysis.c

void analysis(void) {
    FILE *ptr = NULL;
    ptr = fopen("list_sort.txt", "w"); if(!ptr) return;
    printf("len:%d  time:%d\n", len, times);

    arrangement order = ascend;
    struct list_head testlist;
    INIT_LIST_HEAD(&testlist);
    struct listitem *item, *is = NULL;
    size_t i;
    for (i = 0;i < len;i++) {
        item = (struct listitem*)malloc(sizeof(struct listitem));
        list_add(&item->list, &testlist);
    }

    struct timespec time_start;
    struct timespec time_end;
    double during;
    for (int time_i = 0;time_i < times;time_i++) {
        //printf("%d\n", time_i);
        list_for_each_entry_safe(item, is, &testlist, list) {
            item->i = set_rand();
        }
        
        clock_gettime(CLOCK_MONOTONIC, &time_start);
        list_sort(&order, &testlist, cmpu16);
        clock_gettime(CLOCK_MONOTONIC, &time_end);  
        during = time_diff(time_start, time_end);
        if(check(&testlist, &order, cmpu16)) {
          printf("%d check\n", time_i);
          fprintf(ptr, "%d %f\n" , time_i, during);
        }
    }
    fclose(ptr);
}

說明 lib/list_sort.c 最佳化手法

temp note :

/lib/list_sort.c










/*
 * This mergesort is as eager as possible while always performing at least
 * 2:1 balanced merges.  Given two pending sublists of size 2^k, they are
 * merged to a size-2^(k+1) list as soon as we have 2^k following elements.
 *
 * Thus, it will avoid cache thrashing as long as 3*2^k elements can
 * fit into the cache.  Not quite as good as a fully-eager bottom-up
 * mergesort, but it does use 0.2*n fewer comparisons, so is faster in
 * the common case that everything fits into L1.
 */

根據註解此 mergesort 會盡可能地以 2:1 合併，因此當有兩個

2^{k}

大小的串列會合併成

2^{k + 1}

大小的串列來達到避免 cache thrashing 直到

3 \times 2^{k}

個元素至 cache 。
再者雖然 mergesort 沒有 fully-eager bottom-up mergesort 好，但用較少的比較 (

0.2 \times n

)，因此只要在一般情況下跟 L1 合適就比較快。

L1 為 CPU 的 cache 。可以下 getconf -a | grep CACHE 查看。

list_sort( )
















/*
 * The merging is controlled by "count", the number of elements in the
 * pending lists.  This is beautiully simple code, but rather subtle.
 *
 * Each time we increment "count", we set one bit (bit k) and clear
 * bits k-1 .. 0.  Each time this happens (except the very first time
 * for each bit, when count increments to 2^k), we merge two lists of
 * size 2^k into one list of size 2^(k+1).
 *  * This merge happens exactly when the count reaches an odd multiple of
 * 2^k, which is when we have 2^k elements pending in smaller lists,
 * so it's safe to merge away two lists of size 2^k.
 *
 * After this happens twice, we have created two lists of size 2^(k+1),
 * which will be merged into a list of size 2^(k+2) before we create
 * a third list of size 2^(k+1), so there are never more than two pending.
 */

解釋分離與合併時的數量是如何符合最佳化。
















 /* The number of pending lists of size 2^k is determined by the
 * state of bit k of "count" plus two extra pieces of information:
 *
 * - The state of bit k-1 (when k == 0, consider bit -1 always set), and
 * - Whether the higher-order bits are zero or non-zero (i.e.
 *   is count >= 2^(k+1)).
 *
 * There are six states we distinguish.  "x" represents some arbitrary
 * bits, and "y" represents some arbitrary non-zero bits:
 * 0:  00x: 0 pending of size 2^k;           x pending of sizes < 2^k
 * 1:  01x: 0 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k
 * 2: x10x: 0 pending of size 2^k; 2^k     + x pending of sizes < 2^k
 * 3: x11x: 1 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k
 * 4: y00x: 1 pending of size 2^k; 2^k     + x pending of sizes < 2^k
 * 5: y01x: 2 pending of size 2^k; 2^(k-1) + x pending of sizes < 2^k
 * (merge and loop back to state 2) */

說明對於數量是如掌控與操作。




















































{
	struct list_head *list = head->next, *pending = NULL;
	size_t count = 0;	/* Count of pending */

	if (list == head->prev)	/* Zero or one elements */
		return;

	/* Convert to a null-terminated singly-linked list. */
	head->prev->next = NULL;

	/*
	 * Data structure invariants:
	 * - All lists are singly linked and null-terminated; prev
	 *   pointers are not maintained.
	 * - pending is a prev-linked "list of lists" of sorted
	 *   sublists awaiting further merging.
	 * - Each of the sorted sublists is power-of-two in size.
	 * - Sublists are sorted by size and age, smallest & newest at front.
	 * - There are zero to two sublists of each size.
	 * - A pair of pending sublists are merged as soon as the number
	 *   of following pending elements equals their size (i.e.
	 *   each time count reaches an odd multiple of that size).
	 *   That ensures each later final merge will be at worst 2:1.
	 * - Each round consists of:
	 *   - Merging the two sublists selected by the highest bit
	 *     which flips when count is incremented, and
	 *   - Adding an element from the input as a size-1 sublist.
	 */
	do {
		size_t bits;
		struct list_head **tail = &pending;

		/* Find the least-significant clear bit in count */
		for (bits = count; bits & 1; bits >>= 1)
			tail = &(*tail)->prev;
		/* Do the indicated merge */
		if (likely(bits)) {
			struct list_head *a = *tail, *b = a->prev;

			a = merge(priv, (cmp_func)cmp, b, a);
			/* Install the merged result in place of the inputs */
			a->prev = b->prev;
			*tail = a;
		}

		/* Move one element from input list to pending */
		list->prev = pending;
		pending = list;
		list = list->next;
		pending->next = NULL;
		count++;
	} while (list);

count 為 1 ( least-significant clear bit )

會讓 tail 往右側移一位到 head 的位置。

關於 bit 如何操作請看第 158 行。

下圖為 list_sort( ) 192 到 242 行圖解以 count 為 6 時的示意圖。

假設在 list_sort() 裡的 b 先排，則 tail 會在 a 上 :













	/* End of input; merge together all the pending lists. */
	list = pending;
	pending = pending->prev;
	for (;;) {
		struct list_head *next = pending->prev;

		if (!next)
			break;
		list = merge(priv, (cmp_func)cmp, pending, list);
		pending = next;
	}
	/* The final merge, rebuilding prev links */
	merge_final(priv, (cmp_func)cmp, head, pending, list);

最後再把剩下的 pending 合併排序即可。

Reference

3 效能比較

tree sort vs. merge sort

以 10000 節點與 2000 次的重複執行，紀錄 tree-sort.c ( quiz1 ) 和 list_sort.c ( Linux ) 的動態記憶體配置與靜態 stack 所站的空間。
兩個程式排序的數值皆為 uint16_t 單位，且由以下程式碼給出：

static inline int self_random(int seed_f, int seed_s) {
  int sn_1 = 0;
  int sn_2 = 1;
  int sn = random() % 1024;
  int i = seed_f;
  while (i > 0) {
    sn = (sn_2 & sn_1) % (seed_s + 1);
    sn_1 = sn_1 + 3;
    sn_2 = sn_2 + 7;
    i--;
  }
  return sn;
}

uint16_t set_rand(void) {
  time_t current_time;
  srandom(time(&current_time));
  return self_random(random() % 1024, random() % 1024);
}

第一筆資料是以 gcc 裡的 -fstack-usage 得到，中間數字為 stack usage (單位: Byte)。
第二筆資料則是 valgrind -v --leak-check=full 。
第三筆為靜態資料，利用命令 size 。
第四筆則為 gprof 。

關於完整 analysis.txt 資料請點擊函式標題。

tree-sort.c

$ gcc -o test -I./include -I./private tests/tree-sort.c -g -fstack-usage
list.h:83:20:INIT_LIST_HEAD                             16      static
list.h:109:20:list_add_tail                             16      static
list.h:135:20:list_del                                  16      static
rbtree.h:69:20:rb_link_node                             16      static
rbtree.h:154:17:rb_first                                16      static
rbtree.h:166:17:rb_next                                 16      static
rb_tree_augmented.h:115:20:rb_set_parent_color          16      static
rb_tree_augmented.h:122:20:__rb_change_child            16      static
rb_tree_augmented.h:305:31:rb_red_parent                16      static
rb_tree_augmented.h:316:1:__rb_rotate_set_parents       64      static
rb_tree_augmented.h:463:20:dummy_rotate                 16      static
rb_tree_augmented.h:466:6:rb_insert_color               80      static
common.h:16:23:getnum                                   16      static
common.h:34:17:get_unsigned16                           32      static
tree-sort.c:17:13:insert                                80      static
tree-sort.c:34:13:tree_sort                             192     dynamic
tree-sort.c:133:19:self_random                          48      static
tree-sort.c:147:10:set_rand                             48      static
tree-sort.c:156:6:analysis                              160     static
tree-sort.c:189:5:main                                  32      static

$ valgrind -v --leak-check=full ./test
--1341-- REDIR: 0x48f6850 (libc.so.6:free) redirected to 0x483c9d0 (free)
==1341==
==1341== HEAP SUMMARY:
==1341==     in use at exit: 240,000 bytes in 10,000 blocks
==1341==   total heap usage: 10,003 allocs, 3 frees, 248,664 bytes allocated
==1341==
==1341== Searching for pointers to 10,000 not-freed blocks
==1341== Checked 74,168 bytes

$ size test
   text    data     bss     dec     hex filename
   6866     678       2    7546    1d7a test

$ gcc -pg -o tree_sort -I./include -I./private tests/tree-sort.c -g -fstack-usage
$ ./tree_sort
$ gprof tree_sort gmon.out > analysis.txt
Each sample counts as 0.01 seconds.
 no time accumulated
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
  0.00      0.00     0.00 89960398     0.00     0.00  rb_set_parent_color
  0.00      0.00     0.00 59893329     0.00     0.00  rb_red_parent
  0.00      0.00     0.00 20089752     0.00     0.00  dummy_rotate
  0.00      0.00     0.00 20010000     0.00     0.00  list_add_tail
  0.00      0.00     0.00 20002001     0.00     0.00  INIT_LIST_HEAD
  0.00      0.00     0.00 20000000     0.00     0.00  insert
  0.00      0.00     0.00 20000000     0.00     0.00  list_del
  0.00      0.00     0.00 20000000     0.00     0.00  rb_insert_color
  0.00      0.00     0.00 20000000     0.00     0.00  rb_link_node
  0.00      0.00     0.00 20000000     0.00     0.00  rb_next
  0.00      0.00     0.00 20000000     0.00     0.00  self_random
  0.00      0.00     0.00 20000000     0.00     0.00  set_rand
  0.00      0.00     0.00 19951723     0.00     0.00  __rb_change_child
  0.00      0.00     0.00 19951723     0.00     0.00  __rb_rotate_set_parents
  0.00      0.00     0.00     2000     0.00     0.00  rb_first
  0.00      0.00     0.00     2000     0.00     0.00  tree_sort
  0.00      0.00     0.00        1     0.00     0.00  analysis

list_sort.c ( + analysis.c )

$ gcc -o test list_sort.c analysis.c -g -fstack-usage
list_sort.c:17:1:merge          80              static
list_sort.c:44:1:merge_final    80              static
list_sort.c:176:1:list_sort     128             static
list.h:37:20:INIT_LIST_HEAD     16              static
list.h:42:20:list_add           16              static
list.h:109:19:cmpu16            16              static
list.h:122:19:self_random       48              static
analysis.c:10:5:set_rand        48              static
analysis.c:18:6:analysis        160             static
analysis.c:54:5:main            32              static

$ valgrind -v --leak-check=full ./test
--377-- REDIR: 0x48f6850 (libc.so.6:free) redirected to 0x483c9d0 (free)
==377==
==377== HEAP SUMMARY:
==377==     in use at exit: 240,000 bytes in 10,000 blocks
==377==   total heap usage: 10,003 allocs, 3 frees, 248,664 bytes allocated
==377==
==377== Searching for pointers to 10,000 not-freed blocks
==377== Checked 74,152 bytes

$ size test
   text    data     bss     dec     hex filename
   4496     656       8    5160    1428 test

$ gcc -pg -o list_sort list_sort.c analysis.c -g -fstack-usage
$ ./list_sort 
$ gprof list_sort gmon.out > analysis.txt
Each sample counts as 0.01 seconds.
 no time accumulated
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
  0.00      0.00     0.00 130915357     0.00     0.00  cmpu16
  0.00      0.00     0.00 20000000     0.00     0.00  self_random
  0.00      0.00     0.00 20000000     0.00     0.00  set_rand
  0.00      0.00     0.00 19996000     0.00     0.00  merge
  0.00      0.00     0.00    10000     0.00     0.00  list_add
  0.00      0.00     0.00     2000     0.00     0.00  list_sort
  0.00      0.00     0.00     2000     0.00     0.00  merge_final
  0.00      0.00     0.00        1     0.00     0.00  INIT_LIST_HEAD
  0.00      0.00     0.00        1     0.00     0.00  analysis

	tree-sort.c	list_sort.c
total heap usage	10,003 allocs 3 frees 248,664 bytes allocated	10,003 allocs 3 frees 248,664 bytes allocated
main function	192 dynamic (tree_sort)	128 static (list_sort)

可以看出雖然 list-sort cmpu16 高達 130915357 比其他函式呼叫的多，但在從時間花費上可以看出其所造成的成本並不比 tree sort 其他函式呼叫累積起來的成本高。
然而上述資料所呈現的記憶體使用量沒有執行時期的紀錄，因此我開一個執行緒利用 #include <sys/resource.h> 裡的 struct rusage 來記錄每 1ms 記憶體的使用量，但這次節點數為 10000 、次數 100 次：

詳細程式碼請看 dtusage.h 。
可以明顯看出 tree sort 的記憶體使用量確實較大，這點我認為是因為 tree sort 在排序時會需要建立 BST 在與 list-sort ( merge sort )相比產生較多的額外記憶體空間。
經老師提出：

為何不用 valgrind –tool=massif 呢？

沒想到 valgrind 有這功能因此試了一下，只貼上記憶體圖形化與 stack 使用率較為突出的部分：

註：tree-sort.c 有較高的 stack 使用量，然而list-sort.c 除了 n = 1 比較高外都皆為 480 。

關於此問題請看 n = 1 stack 使用量大問題。

tree-sort.c


    KB
712.4^                :   @                                          @
     |#::           ::: : @                                        ::@       :
     |#::           : : : @                                        ::@       :
     |#::           : : : @                                        ::@       :
     |#::           : : : @                                        ::@       :
     |#::           : : : @                                        ::@       :
     |#::           : : : @                                        ::@       :
     |#::           : : : @                                        ::@       :
     |#::           : : : @                                        ::@       :
     |#:::::::::::::: @:::@::::::::::::@:::::::@:::@@:::@:::::@::::::@::::::@:
     |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@:
     |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@:
     |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@:
     |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@:
     |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@:
     |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@:
     |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@:
     |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@:
     |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@:
     |#:::::: ::::::: @:::@::::::: ::::@:::::::@:::@ :::@:::::@::::::@::::::@:
   0 +----------------------------------------------------------------------->Gi
     0                                                                   12.54
Number of snapshots: 84
 Detailed snapshots: [1 (peak), 16, 20, 33, 42, 46, 50, 60, 70, 80]
--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1    140,839,206          725,216          244,568       160,024      320,624
33.72% (244,568B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->33.09% (240,000B) 0x109F13: analysis (tree-sort.c:168)
| ->33.09% (240,000B) 0x10A0C2: main (tree-sort.c:196)
|
->00.63% (4,568B) in 1+ places, all below ms_print's threshold (01.00%)

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  2    293,986,963          729,392          248,664       160,032      320,696
  3    442,401,488          729,392          248,664       160,032      320,696
  4    605,555,715          409,176          248,664       160,032          480
  5    770,853,878          409,176          248,664       160,032          480
  6  1,049,488,316          409,176          248,664       160,032          480
  7  1,228,610,437          409,176          248,664       160,032          480
  8  1,504,261,304          409,176          248,664       160,032          480
  9  1,779,934,462          409,176          248,664       160,032          480
 10  1,926,957,892          409,176          248,664       160,032          480
 11  2,197,708,686          409,176          248,664       160,032          480
 12  2,398,560,278          409,176          248,664       160,032          480
 13  2,549,199,492          409,176          248,664       160,032          480
 14  2,750,050,755          729,392          248,664       160,032      320,696
 15  3,026,222,400          729,472          248,664       160,032      320,776
 16  3,176,865,118          409,176          248,664       160,032          480
60.77% (248,664B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->58.65% (240,000B) 0x109F13: analysis (tree-sort.c:168)
| ->58.65% (240,000B) 0x10A0C2: main (tree-sort.c:196)

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 61 10,728,706,661          409,168          248,664       160,032          472
 62 10,852,877,952          409,176          248,664       160,032          480
 63 10,977,050,052          409,176          248,664       160,032          480
 64 11,101,221,318          409,176          248,664       160,032          480
 65 11,225,393,418          409,176          248,664       160,032          480
 66 11,349,558,940          409,168          248,664       160,032          472
 67 11,473,724,463          729,312          248,664       160,032      320,616
 68 11,597,890,448          409,176          248,664       160,032          480
 69 11,722,055,973          729,400          248,664       160,032      320,704
 70 11,846,221,495          729,472          248,664       160,032      320,776
34.09% (248,664B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->32.90% (240,000B) 0x109F13: analysis (tree-sort.c:168)
| ->32.90% (240,000B) 0x10A0C2: main (tree-sort.c:196)

list-sort.c (+ analysis.c )

    KB
401.5^#
     |#:::::::::::::::::::::::@::@::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
     |#:::::::::::::::::::::::@: @::::::::::::@:::::::@:::::@::::@:::::@:::::@
   0 +----------------------------------------------------------------------->Gi
     0                                                                   12.62
Number of snapshots: 97
 Detailed snapshots: [1 (peak), 28, 30, 44, 55, 65, 75, 85, 95]
--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1    172,168,618          411,160          248,664       160,032        2,464
60.48% (248,664B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->58.37% (240,000B) 0x109868: analysis (analysis.c:30)
| ->58.37% (240,000B) 0x109A23: main (analysis.c:62)

--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 96 13,553,440,562          409,176          248,664       160,032          480

n = 1 stack 使用量大問題

經反覆測試與檢測初始化所造成的成本發現， n = 1 會提高的原因是因為 analysis.c 的 linked list 是先配置完記憶體後，進入數測量次數的迴圈再走訪一次設定亂數。
會這麼做的原因是因為程式比較好寫，不須寫例外狀況(次數為一)。因此第一次會因為呼叫 list_add 與設定鑑測資料(如時間)而導致 n = 1成本變高。
以下是函式簡化後：

analysis.c (簡化)

#define len 1000000
#define times 1
int main(void) {
    printf("len:%d  time:%d\n", len, times);
    arrangement order = ascend;
    struct list_head testlist;
    INIT_LIST_HEAD(&testlist);
    struct listitem *item, *is = NULL;
    size_t i;
    for (i = 0;i < len;i++) {
        item = (struct listitem*)malloc(sizeof(struct listitem));
        list_add(&item->list, &testlist);
    }
    for (int time_i = 0;time_i < times;time_i++) {
        list_for_each_entry_safe(item, is, &testlist, list) {
            item->i = set_rand();
        }
        list_sort(&order, &testlist, cmpu16);
    }
}

    MB
38.15^                                                                    ::::
     |##::::::::::::::::::::::::::::::::::::::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
     |# :: ::::::: :::: ::::: :::: :::: : : ::::@::::@::::::::::::::::@:::::::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   10.50
--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1     63,543,365       39,635,968       23,781,736    15,853,816          416
  2    326,840,523       40,001,432       24,001,024    16,000,008          400
 50  9,180,465,501       40,001,432       24,001,024    16,000,008          400
 67 11,273,922,938       40,001,584       24,001,024    16,000,008          552

main.c ( 簡化 )

#define len 1000000
u16 values[len];
int main(int argc, char **argv) {
  // create
  struct list_head testlist;
  arrangement order = ascend;
  struct listitem *item, *is = NULL;
  size_t i;
  u16 seed = (uintptr_t)*argv;
  INIT_LIST_HEAD(&testlist);
  random_u16(&testlist, len, seed);
  // sort
  list_sort(&order, &testlist, cmpu16);
  if(check(&testlist, &order, cmpu16))
    printf("in order\n");
  else 
    printf("not in order\n");
  // free
  return 0;
}

    MB
38.15^                                                               ::   : : 
     |                                                           @@#:::::::@::
     |                                                        @@@@@#:::::::@::
     |                                                     @@@@@@@@#:::::::@::
     |                                                  :::@@@@@@@@#:::::::@::
     |                                               @@::::@@@@@@@@#:::::::@::
     |                                           @@@@@@::::@@@@@@@@#:::::::@::
     |                                         @@@@@@@@::::@@@@@@@@#:::::::@::
     |                                      @@@@@@@@@@@::::@@@@@@@@#:::::::@::
     |                                  ::@@@@ @@@@@@@@::::@@@@@@@@#:::::::@::
     |                               @@:::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@::
     |                             @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@::
     |                         @@@@@@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@::
     |                      @:@@@@ @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@::
     |                   @@@@:@@@@ @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@::
     |                @@@@@@@:@@@@ @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@::
     |             @@:@@@@@@@:@@@@ @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@::
     |         @@:@@@:@@@@@@@:@@@@ @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@::
     |       @@@@:@@@:@@@@@@@:@@@@ @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@::
     |    @@@@@@@:@@@:@@@@@@@:@@@@ @@@ :::@ @@ @@@@@@@@::::@@@@@@@@#:::::::@::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   6.900
 --------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
  0              0                0                0             0            0
  1    142,377,837          887,104          531,984       354,656          464
  2    297,677,838        1,854,024        1,112,136       741,424          464
 50  5,602,886,870       34,932,744       20,959,368    13,972,912          464
 77  7,409,064,283       40,000,520       24,000,000    16,000,000          520

可以看出 stack 被壓縮到先前運行中段時的成本 480 。

Reference

測驗 `2`

1 解釋程式運作原理

考慮函式 func 接受一個 16 位元無號整數
$N$ ，並回傳小於或等於
$N$ 的 power-of-2 (漢字可寫作為 2 的冪)









uint16_t func(uint16_t N) {
    /* change all right side bits to 1 */
    N |= N >> 1;
    N |= N >> 2;
    N |= N >> 4;
    N |= N >> 8;

    return (N + 1) >> 1;
}

example: N =
$10101_{2}$ =
$21_{10}$
first

N |= N >> 1;

先省略 4 到 5 行，因為這裡還用不到。

return (N + 1) >> 1;

最後會回傳 16 (0001_0000) 。
可以理解成在進行 3 到 6 行的右移是為了保證最高是 1 的位元以下(以後)都是 1 。
之後對其進行加一進位，隨後在往右移一位即得小於或等於輸入數值的 power-of-2。

C11 6.3 Conversions

但這就有問題了，如果是 65535 (1111_1111_1111_1111) 預期結果會是什麼？
如果是以平常寫程式的經驗推論，會是在 3 到 6 行都是與輸入一樣後，加 1 使其 overflow 為 [1]_0000_0000_0000 再 >> 1 。

應該等於 0 吧。

然而實測卻是這樣：

$ cat main.c
#include <stdio.h>
#include <stdint.h>
uint16_t func(uint16_t N) {
    /* change all right side bits to 1 */
    N |= N >> 1;
    N |= N >> 2;
    N |= N >> 4;
    N |= N >> 8;
    return (N + 1) >> 1;
}

int main(){
    uint16_t num = 65535;
    printf("%%d  %d %d\n", num, func(num));
    printf("%%u  %d %u\n", num, func(num));
    printf("%%hu %d %hu\n", num, func(num));
}
$ gcc -o test main.c
$ ./test
%d  65535 32768
%u  65535 32768
%hu 65535 32768

可能會覺得 printf 的型態 ( %d ) 是 int 非 uint16_t ( unsigned short )，但這並不能解釋傳輸的值為 .. 1000 0000 0000 0000 。 ( 已補上各個型態輸出 )

因此閱讀了 C11 N2310 ISO/IEC 9899:~~2017~~2x(E) 的 6.3 Conversions 後有了些想法：
以下若非特別註明皆引述自 C11 N2310 ISO/IEC 9899:~~2017~~2x(E)。

6.3

1 Several operators convert operand values from one type to another automatically. This subclause specifies the result required from such an implicit conversion, as well as those that result from a cast operation (an explicit conversion).

數個 operators 會自動地對 operand values 進行型態轉換。

6.3.1.1 Boolean, characters, and integers - 1

— The rank of a signed integer type shall be greater than the rank of any signed integer type with less precision.
— The rank of long long int shall be greater than the rank of long int, which shall be greater than the rank of int, which shall be greater than the rank of short int, which shall be greater than the rank of signed char.
— The rank of any unsigned integer type shall equal the rank of the corresponding signed integer type, if any.

對各個型態的 integer 之間的 rank 有明確規範。

the rank of integer > the rank of short integer
the rank of unsigned int == the rank of the corresponding signed integer
The rank of signed integer > the rank of any signed integer type with less precision

6.3.1.3 Signed and unsigned integers

1 When a value with integer type is converted to another integer type other than _Bool, if the value can be represented by the new type, it is unchanged.
2 Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type
$.^{61}$ )

一個 integer type 再轉換到另一個 integer type 時，若可表示則不變。

6.3.1.8 Usual arithmetic conversions

1 Many operators that expect operands of arithmetic type cause conversions and yield result types in a similar way. The purpose is to determine a common real type for the operands and result. For the specified operands, each operand is converted, without change of type domain, to a type whose corresponding real type is the common real type. Unless explicitly stated otherwise, the common real type is also the corresponding real type of the result, whose type domain is the type domain of the operands if they are the same, and complex otherwise. This pattern is called the usual arithmetic conversions :
…
Otherwise, if the operand that has unsigned integer type has rank greater or equal to the rank of the type of the other operand, then the operand with signed integer type is converted to the type of the operand with unsigned integer type.
Otherwise, if the type of the operand with signed integer type can represent all of the values of the type of the operand with unsigned integer type, then the operand with unsigned integer type is converted to the type of the operand with signed integer type.

如果 unsigned integer 的 rank 大於等於另一個整數型態，另個會被轉換成 unsigned integer 。
如果 signed integer type 可以表示所有 unsigned integer type 的值則 unsigned integer type 會被轉換成 signed integer type

6.4.4.1 Integer constants - Semantics

5 The type of an integer constant is the first of the corresponding list in which its value can be represented.

可以知道 1 會是 int 型態，亦即 int32_t 。

最後可以總結出：

return (N + 1) >> 1;

型態為 uint16_t 的變數 N 會因為 1 轉成有號整數 (int32_t) 型態，再做 right shift 。

0000 0000 0000 0001 _ 0000 0000 0000 0000

而關於 shift，規格書也有規範到 (6.5.7 Bitwise shift operators) ：

5 The result of
$E 1$ >>
$E 2$ is
$E 1$ right-shifted
$E 2$ bit positions. If
$E 1$ has an unsigned type or if
$E 1$ has asigned type and a nonnegative value, the value of the result is the integral part of the quotient of
$E 1 / 2^{E 2}$ . If E1 has a signed type and a negative value, the resulting value is implementation-defined.

0000 0000 0000 0000 _ 1000 0000 0000 0000 => 32768

最後再以 uint16_t 的 32768 (1000 0000 0000 0000) 傳出函式。

2 is_power_of_2

說明 is_power_of_2 原理

kernel api - is_power_of_2
bool is_power_of_2(unsigned long n)

Determine whether some value is a power of two, where zero is not considered a power of two.

include/linux/log2.h

/**
 * is_power_of_2() - check if a value is a power of two
 * @n: the value to check
 *
 * Determine whether some value is a power of two, where zero is
 * *not* considered a power of two.
 * Return: true if @n is a power of 2, otherwise false.
 */
static inline __attribute__((const))
bool is_power_of_2(unsigned long n)
{
	return (n != 0 && ((n & (n - 1)) == 0));
}

確認 n 不為零，且在與自己減一過後的值 AND 後為零的話回傳 true 。
(1000 & 0111) == 0 => true
(1111 & 1110) == 0 => false

/**
 * __roundup_pow_of_two() - round up to nearest power of two
 * @n: value to round up
 */
static inline __attribute__((const))
unsigned long __roundup_pow_of_two(unsigned long n)
{
	return 1UL << fls_long(n - 1);
}

/**
 * __rounddown_pow_of_two() - round down to nearest power of two
 * @n: value to round down
 */
static inline __attribute__((const))
unsigned long __rounddown_pow_of_two(unsigned long n)
{
	return 1UL << (fls_long(n) - 1);
}

1UL 為 unsigned long 型的 1 。

static inline unsigned fls_long(unsigned long l)
{
	if (sizeof(l) == 4)
		return fls(l);
	return fls64(l);
}

會區分 32 與 64 位元。

而繼續追 fls64 會查到是在 arch 目錄下：

arch/alpha/include/asm/bitops.h, line 369 (as a function)
arch/alpha/include/asm/bitops.h, line 376 (as a function)
arch/powerpc/include/asm/bitops.h, line 235 (as a function)
arch/s390/include/asm/bitops.h, line 395 (as a function)
arch/x86/include/asm/bitops.h, line 366 (as a function)

可以看到會因各個架構而有不同定義，在 x86 底下： Linux kernel source tree 。
如果查看 alpha 會是：

































/*
 * fls: find last bit set.
 */
#if defined(CONFIG_ALPHA_EV6) && defined(CONFIG_ALPHA_EV67)
static inline int fls64(unsigned long word)
{
	return 64 - __kernel_ctlz(word);
}
#else
extern const unsigned char __flsm1_tab[256];

static inline int fls64(unsigned long x)
{
	unsigned long t, a, r;

	t = __kernel_cmpbge (x, 0x0101010101010101UL);
	a = __flsm1_tab[t];
	t = __kernel_extbl (x, a);
	r = a*8 + __flsm1_tab[t] + (x != 0);

	return r;
}
#endif

static inline unsigned long __fls(unsigned long x)
{
	return fls64(x) - 1;
}

static inline int fls(unsigned int x)
{
	return fls64(x);
}

#  define __kernel_ctlz(x)		__builtin_clzl(x)

Built-in Function: int __builtin_clzl (unsigned long)
Similar to __builtin_clz, except the argument type is unsigned long.

而 __builtin_clz 則是計算從 most significant bit 開始有幾個 0 ，亦即在到第一個 1 之前有幾個 0 。

舉例
在 alpha 下， n = 139,611,588,448,485,376
( 0000 0001 1111 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 )
- roundup_pow_of_two
  fls_long (fls64) 會是 64 - 7 = 57 。
  1UL << 57 = 144,115,188,075,855,872
  ( 0000 0010 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 )
- rounddown_pow_of_two
  fls_long (fls64) 會是 64 - 7 = 57 。
  1UL << (57 - 1) = 72,057,594,037,927,936
  ( 0000 0001 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 )

考量

在記憶體分配的 buddy memory allcation 原理：

There are various forms of the buddy system; those in which each block is subdivided into two smaller blocks are the simplest and most common variety. Every memory block in this system has an order, where the order is an integer ranging from 0 to a specified upper limit. The size of a block of order n is proportional to
$2^{n}$ , so that the blocks are exactly twice the size of blocks that are one order lower. Power-of-two block sizes make address computation simple, because all buddies are aligned on memory address boundaries that are powers of two. When a larger block is split, it is divided into two smaller blocks, and each smaller block becomes a unique buddy to the other. A split block can only be merged with its unique buddy block, which then reforms the larger block they were split from.

可以知道記憶體會被切分成大小為
$2^{n}$ 的 block ，且是以其作單位去分配給行程等使用。
wikipedia 給了一個很好理解的例子請去查看 budd memory allocation - example 。

如果查國家教育研究院的雙語詞彙、學術名詞暨辭書資訊網也可以查到說明

伙伴系統 buddy system
2003年6月資訊與通信術語辭典
名詞解釋:
資料結構的一種儲存器管理系統。系統中儲存器的模塊總是劃分成2的冪次。如果一個模塊比存儲要求大得多，就將它分成兩個規模都是2的冪次的模塊，稱此兩模塊為〝伙伴〞。如果兩個伙伴均為可用，可再重新組合成一個較大的模塊，因為它們都是2的冪次，所以容易計算出模塊的伙伴的位址，而後確定是否可以組合成大模塊。
但不可能每個去要記憶體的行程都剛好使用
$2^{n}$ 大小，因此在 Linux 核心內部有 slab allocator 去管理剩下空閒的記憶體。
在其中，如何知曉資料身處在哪個 block 此 is_power_of_2 等函式就顯得重要許多。

3 slab allocator

slab allocator 簡介

OSTEP : Free-Space Management

Just like any good idea, this approach introduces new complications into a system as well. For example, how much memory should one dedicate to the pool of memory that serves specialized requests of a given size, as opposed to the general pool? One particular allocator, the slab allocator by uber-engineer Jeff Bonwick (which was designed for use in the Solaris kernel), handles this issue in a rather nice way [B94].

Specifically,** when the kernel boots up, it allocates a number of object caches for kernel objects that are likely to be requested frequently (such as locks, file-system inodes, etc.)**; the object caches thus are each segregated free lists of a given size and serve memory allocation and free requests quickly. When a given cache is running low on free space, it requests some slabs of memory from a more general memory allocator (the total amount requested being a multiple of the page size and the object in question).
linuxfound.org

完整請看 YouTube : SL[AUO]B: Kernel memory allocator design and philosophy
Linux 核心設計 - 記憶體管理

YouTube : Linux 核心設計：記憶體管理 (二) (2019-04-14)

註：此標示僅為簡介，若要完整知道請全部看完三部共 8 小時 46 分鐘左右。

is_power_of_2 在 Linux kernel 的使用

mm/slab_common.c






























/* Create a cache during boot when no slab services are available yet */
void __init create_boot_cache(struct kmem_cache *s, const char *name,
		unsigned int size, slab_flags_t flags,
		unsigned int useroffset, unsigned int usersize)
{
	int err;
	unsigned int align = ARCH_KMALLOC_MINALIGN;

	s->name = name;
	s->size = s->object_size = size;

	/*
	 * For power of two sizes, guarantee natural alignment for kmalloc
	 * caches, regardless of SL*B debugging options.
	 */
	if (is_power_of_2(size))
		align = max(align, size);
	s->align = calculate_alignment(flags, align, size);

	s->useroffset = useroffset;
	s->usersize = usersize;

	err = __kmem_cache_create(s, flags);

	if (err)
		panic("Creation of kmalloc slab %s size=%u failed. Reason %d\n",
					name, size, err);

	s->refcount = -1;	/* Exempt from merging for now */
}

mm/slob.c



















































/*
 * End of slob allocator proper. Begin kmem_cache_alloc and kmalloc frontend.
 */

static __always_inline void *
__do_kmalloc_node(size_t size, gfp_t gfp, int node, unsigned long caller)
{
	unsigned int *m;
	int minalign = max_t(size_t, ARCH_KMALLOC_MINALIGN, ARCH_SLAB_MINALIGN);
	void *ret;

	gfp &= gfp_allowed_mask;

	might_alloc(gfp);

	if (size < PAGE_SIZE - minalign) {
		int align = minalign;

		/*
		 * For power of two sizes, guarantee natural alignment for
		 * kmalloc()'d objects.
		 */
		if (is_power_of_2(size))
			align = max(minalign, (int) size);

		if (!size)
			return ZERO_SIZE_PTR;

		m = slob_alloc(size + minalign, gfp, align, node, minalign);

		if (!m)
			return NULL;
		*m = size;
		ret = (void *)m + minalign;

		trace_kmalloc_node(caller, ret,
				   size, size + minalign, gfp, node);
	} else {
		unsigned int order = get_order(size);

		if (likely(order))
			gfp |= __GFP_COMP;
		ret = slob_new_pages(gfp, order, node);

		trace_kmalloc_node(caller, ret,
				   size, PAGE_SIZE << order, gfp, node);
	}

	kmemleak_alloc(ret, size, 1, gfp);
	return ret;
}

在 mm/slab_common.c 第 556 行和 mm/slob.c 第 483 行說道，是為了確保在 kmalloc caches 是對齊。

create_boot_cache 在 mm/slab.c , mm/slab_common.c 和 mm/slub.c 被呼叫。
__do_kmalloc_node 則在 mm/slab.c , mm/slob.c 被呼叫。

而關於 .. guarantee natural alignment for kmalloc .. 這行註解，可以從 lwn 的一篇文章看起 Alignment guarantees for kmalloc() ：

In particular, Babka wanted to discuss when kmalloc() should return objects with their "natural alignment". What that term means was not actually defined until near the end of the session; presumably nobody will object to a slight bit of reordering at this point. Natural alignment for an object means that its beginning address is a multiple of each of up to three different quantities: ARCH_KMALLOC_MINALIGN (eight bytes, by default), the cache-line size (for larger objects), and the size of the object itself when that size is a power of two. The actual required alignment will generally be the least common multiple of the three.

可以看到對於 natural alignment 的說明，然而此篇文章主要是在探討 kmalloc 的意外狀況：

Most of the time, Babka said, kmalloc() already returns naturally aligned objects for a power-of-two allocation size; that result falls out of how the slab pages are laid out. But there are exceptions: when SLUB debugging is enabled or when the SLOB allocator is used. Few people worry much about SLOB, but the SLUB debugging exception can lead to problems for unsuspecting developers.

有了對 natural alignment 定義後，可以再看 natural alignement 的大略描述，引自 Red Hat Developer blog ：

Natural alignment describes the practice in which the address of the data type is a multiple of the size of the data type. Using natural alignment allows the processor to avoid doing multiple memory operations to access a single value. This natural alignment has a cost, and it can lead to larger data structures.

To improve the speed of memory accesses and processor performance, each memory access on modern processors deals with a larger group of bits than a single eight bit byte, typically, 64 bits (eight bytes) or 128 bits (16 bytes).

For example, a double precision floating point value is made up of eight bytes (b0 through b7) and we assume the memory width is eight bytes. If the double precision floating point number is naturally aligned, the entire value starting at address+0 can be read with a single memory access (as can be seen in the diagram below.)

However, if the floating point number is not aligned (unaligned), it straddles two words in memory and two read must be performed (one read for b0 through b3 and another read for b4 through b7). Additionally, the processor needs to paste the bits from the two reads into a single 64-bit value which also takes time.

可以看到是為了對 CPU 讀取數值時的次數等最佳化。
此篇內容也有提到如果進行了 Alignment 可能會因為原先要儲存的數值小於對齊的需求，因而有 padding 來填充使其符合大小。
可以利用 gcc 的 -Wpadded 來檢查是否有 padding 。

文中也給出了很好的例子來告訴我們再知曉其操作後如何 re-order 這些資料：

struct edge {
 int from;            \\ 4 bytes + ( padding 4 bytes )
 double distance;     \\ 8 bytes
 int to;              \\ 4 bytes + ( padding 4 bytes )
} edge;

struct edge {
 int from;            \\ 4 bytes + 4 bytes
 int to;
 double distance;     \\ 8 bytes
};

在 x86_64 下。

詳細請點此連結。

Reference

額外閱讀

I/O Performance HOWTO - Avoiding Bounce Buffers

mm/percpu.c

























































/**
 * pcpu_alloc - the percpu allocator
 * @size: size of area to allocate in bytes
 * @align: alignment of area (max PAGE_SIZE)
 * @reserved: allocate from the reserved chunk if available
 * @gfp: allocation flags
 *
 * Allocate percpu area of @size bytes aligned at @align.  If @gfp doesn't
 * contain %GFP_KERNEL, the allocation is atomic. If @gfp has __GFP_NOWARN
 * then no warning will be triggered on invalid or failed allocation
 * requests.
 *
 * RETURNS:
 * Percpu pointer to the allocated area on success, NULL on failure.
 */
static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
				 gfp_t gfp)
{
	gfp_t pcpu_gfp;
	bool is_atomic;
	bool do_warn;
	enum pcpu_chunk_type type;
	struct list_head *pcpu_slot;
	struct obj_cgroup *objcg = NULL;
	static int warn_limit = 10;
	struct pcpu_chunk *chunk, *next;
	const char *err;
	int slot, off, cpu, ret;
	unsigned long flags;
	void __percpu *ptr;
	size_t bits, bit_align;

	gfp = current_gfp_context(gfp);
	/* whitelisted flags that can be passed to the backing allocators */
	pcpu_gfp = gfp & (GFP_KERNEL | __GFP_NORETRY | __GFP_NOWARN);
	is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
	do_warn = !(gfp & __GFP_NOWARN);

	/*
	 * There is now a minimum allocation size of PCPU_MIN_ALLOC_SIZE,
	 * therefore alignment must be a minimum of that many bytes.
	 * An allocation may have internal fragmentation from rounding up
	 * of up to PCPU_MIN_ALLOC_SIZE - 1 bytes.
	 */
	if (unlikely(align < PCPU_MIN_ALLOC_SIZE))
		align = PCPU_MIN_ALLOC_SIZE;

	size = ALIGN(size, PCPU_MIN_ALLOC_SIZE);
	bits = size >> PCPU_MIN_ALLOC_SHIFT;
	bit_align = align >> PCPU_MIN_ALLOC_SHIFT;

	if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE ||
		     !is_power_of_2(align))) {
		WARN(do_warn, "illegal size (%zu) or align (%zu) for percpu allocation\n",
		     size, align);
		return NULL;
	}

在第 1713 行，檢測要配置給 CPU 的記憶體大小是否對齊。

Reference

C99 6.5.7 Bitwise shift operators
WG14/N1256 - ISO/IEC 9899:TC3 - Footnotes 44, 45
C11 N2310 ISO/IEC 9899:20172x(E)
linuxfound - slab/slob/slub
- YouTube
Slab-allocators
Linux 核心設計 - 記憶體管理
OSTEP 17.4

測驗 `3`

1 解釋程式運作原理與重寫

解釋

void bitcpy(void *_dest,      /* Address of the buffer to write to */
            size_t _write,    /* Bit offset to start writing to */
            const void *_src, /* Address of the buffer to read from */
            size_t _read,     /* Bit offset to start reading from */
            size_t count)
    size_t read_lhs = _read & 7;
    size_t read_rhs = 8 - read_lhs;
    const uint8_t *source = (const uint8_t *) _src + (_read / 8);
    size_t write_lhs = _write & 7;
    size_t write_rhs = 8 - write_lhs;
    uint8_t *dest = (uint8_t *) _dest + (_write / 8);

_write 與 _read 是為了確保 bit 在做完複製後期結果為符合我們預期 (即想要的偏移量)。可以看到在第 4 到第 8 行對讀取資料的位置進行了修正：



























 while (count > 0) {
        uint8_t data = *source++;
        size_t bitsize = (count > 8) ? 8 : count;
        if (read_lhs > 0) {
            data <<= read_lhs;
            if (bitsize > read_rhs)
                data |= (*source >> read_rhs);
        }

        if (bitsize < 8)
            data &= read_mask[bitsize];

        uint8_t original = *dest;
        uint8_t mask = read_mask[write_lhs];
        if (bitsize > write_rhs) {
            /* Cross multiple bytes */
            *dest++ = (original & mask) | (data >> write_lhs);
            original = *dest & write_mask[bitsize - write_rhs];
            *dest = original | (data << write_rhs);
        } else {
            // Since write_lhs + bitsize is never >= 8, no out-of-bound access.
            mask |= write_mask[write_lhs + bitsize];
            *dest++ = (original & mask) | (data >> write_lhs);
        }

        count -= bitsize;
    }

在第 13 到第 24 行處理寫入的操作，以 bitsize > write_rhs 分成了跨位元組與否的複製。
寫入前對寫入目的地與 mask 的前置動作在第 13 及第 14 行。
- uint8_t original = *dest; 儲存原先資訊
- uint8_t mask = read_mask[write_lhs]; 設置 mask 為寫入的資料偏移位置做處理
多位元組複製 - 第一個位元組
- original & mask (清除要放入位置的資料)
- data >> write_lhs (調整 data 到我們要放入的位置)
- *dest++ = (original & mask) | (data >> write_lhs); 寫入
多位元組複製 - 第二個位元組
- original = *dest & write_mask[bitsize - write_rhs]; 重新利用 original 為下一個位元組要放入的資料的 mask 。
- *dest = original | (data << write_rhs); 寫入

example:
write in _ _ _ _ _ _ 1 1 , 1 1 _ _ _ _ _ _
write_rhs = 2
write_lhs = 6

改寫

#define read_mask(t) ((uint8_t)(0xFF << (8 - t)))   
#define write_mask(t) (~read_mask((t)))
        //if (bitsize < 8)
        data &= read_mask(bitsize);

        uint8_t mask = read_mask(write_lhs);
        if (bitsize > write_rhs) {
            *dest = (*dest & mask) | (data >> write_lhs);
            dest++;
            *dest = *dest & write_mask(bitsize - write_rhs);
            *dest = *dest | (data << write_rhs);
        } else {
            // Since write_lhs + bitsize is never >= 8, no out-of-bound access.
            mask |= write_mask(write_lhs + bitsize);
            *dest++ = (*dest & mask) | (data >> write_lhs);
        }
        count -= bitsize;

2 Linux 核心內的 bit 複製

lib/bitmap.c











void __bitmap_replace(unsigned long *dst,
		      const unsigned long *old, const unsigned long *new,
		      const unsigned long *mask, unsigned int nbits)
{
	unsigned int k;
	unsigned int nr = BITS_TO_LONGS(nbits);

	for (k = 0; k < nr; k++)
		dst[k] = (old[k] & ~mask[k]) | (new[k] & mask[k]);
}
EXPORT_SYMBOL(__bitmap_replace);

drivers/video/fbdev/amifb.c









	/*
	 *  Unaligned forward bit copy using 32-bit or 64-bit memory accesses
	 */

static void bitcpy(unsigned long *dst, int dst_idx, const unsigned long *src,
		   int src_idx, u32 n)
           ...

圖形處理的 Framebuffer

引述自 Stéphane Marchesin - Linux Graphics Drivers: an Introduction ：

4.2 Framebuffer operations
The framebuffer operations structure is how non-modesetting framebuffer callbacksare set. Different callbacks can be set depending on what functionality you wish to implement, like fills, copies, or cursor handling. By filling struct fb_ops callbacks, one can implement the following functions:

Copy data from area to another
void fb_copyarea(struct fb_info ∗info, const struct fb_copyarea ∗region)

而 fb_copyarea 是被包在 fb_ops 結構中，在 drivers/video/fbdev/amifb.c 初始為以下：












static const struct fb_ops amifb_ops = {
	.owner		= THIS_MODULE,
	.fb_check_var	= amifb_check_var,
	.fb_set_par	= amifb_set_par,
	.fb_setcolreg	= amifb_setcolreg,
	.fb_blank	= amifb_blank,
	.fb_pan_display	= amifb_pan_display,
	.fb_fillrect	= amifb_fillrect,
	.fb_copyarea	= amifb_copyarea,
	.fb_imageblit	= amifb_imageblit,
	.fb_ioctl	= amifb_ioctl,
};

amifb_copyarea 則一樣在 drivers/video/fbdev/amifb.c 中：





















static void amifb_copyarea(struct fb_info *info,
			   const struct fb_copyarea *area)
    ...
	if (rev_copy) {
		while (height--) {
			dst_idx -= par->next_line * 8;
			src_idx -= par->next_line * 8;
			copy_one_line_rev(info->var.bits_per_pixel,
					  par->next_plane, dst, dst_idx, src,
					  src_idx, width);
		}
	} else {
		while (height--) {
			copy_one_line(info->var.bits_per_pixel,
				      par->next_plane, dst, dst_idx, src,
				      src_idx, width);
			dst_idx += par->next_line * 8;
			src_idx += par->next_line * 8;
		}
	}
}

其中 copy_one_line 使用到 bitcpy ：
















static inline void copy_one_line(int bpp, unsigned long next_plane,
				 unsigned long *dst, int dst_idx,
				 unsigned long *src, int src_idx, u32 n)
{
	while (1) {
		dst += dst_idx >> SHIFT_PER_LONG;
		dst_idx &= (BITS_PER_LONG - 1);
		src += src_idx >> SHIFT_PER_LONG;
		src_idx &= (BITS_PER_LONG - 1);
		bitcpy(dst, dst_idx, src, src_idx, n);
		if (!--bpp)
			break;
		dst_idx += next_plane * 8;
		src_idx += next_plane * 8;
	}
}

user and kernel space

是以 byte 單位。
引自 Oracle® Linux 6Porting Guide

You can use the copy_from_user() and copy_to_user() functions to move data between kernel space and user space. Alternatively, when moving one, two, four, or eight bytes of data, you can use either put_user() and get_user() or access_ok() to validate the user-space address followed by either __put_user() or __get_user().

If user programs require direct access to device memory, you can use the mmap() system call, which maps device memory directly into user space. For example, the X server uses mmap() to write to video adapter memory and PCI devices usually memory map their control registers to improve performance. A limitation is that the area being mapped has to be a multiple of PAGE_SIZE and start at a physical memory address that is also a multiple of PAGE_SIZE.

額外

查閱資料時看到此函式，覺得有趣也上放來。
在 linux-kernel-labs.github.io 的 memory mapping 有看到

Enable the PG_reserved bit of each page with SetPageReserved(). Clear the bit with ClearPageReserved() before freeing the memory.

其中 SetPageReserved() 為 #define SetPageReserved(page) set_bit(PG_reserved, &(page)->flags) 。

而此作用在 linux-kernel-labs.github.io - memory mapping 是

Since the pages are mapped to user space, they might be swapped out. To avoid this we must set the PG_reserved bit on the page. Enabling is done using SetPageReserved() while reseting it (which must be done before freeing the memory) is done with ClearPageReserved():

因此去查了原始碼竟然是用 inline assembly 實作，是為了確保是 atomic 層級的操作

arch/alpha/include/asm/bitops.h

/*
 * These have to be done with inline assembly: that way the bit-setting
 * is guaranteed to be atomic. All bit operations return 0 if the bit
 * was cleared before the operation and != 0 if it was not.
 *
 * To get proper branch prediction for the main line, we must branch
 * forward to code at the end of this object's .text section, then
 * branch back to restart the operation.
 *
 * bit 0 is the LSB of addr; bit 64 is the LSB of (addr+1).
 */

static inline void
set_bit(unsigned long nr, volatile void * addr)
{
	unsigned long temp;
	int *m = ((int *) addr) + (nr >> 5);

	__asm__ __volatile__(
	"1:	ldl_l %0,%3\n"
	"	bis %0,%2,%0\n"
	"	stl_c %0,%1\n"
	"	beq %0,2f\n"
	".subsection 2\n"
	"2:	br 1b\n"
	".previous"
	:"=&r" (temp), "=m" (*m)
	:"Ir" (1UL << (nr & 31)), "m" (*m));
}

Reference

測驗 `4`

1 解釋程式運作原理

結構定義

// thing, will put int to the box

typedef struct __cstr_data {
    char *cstr;
    uint32_t hash_size;
    uint16_t type;
    uint16_t ref;
} * cstring;

typedef struct __cstr_buffer {
    cstring str;
} cstr_buffer[1];

// control all the thing
// __cstr_node is box
// pool is the place store those boxes
// hash => index_0 index_1, index -> linked list
// pool => all the node

struct __cstr_node {
    char buffer[CSTR_INTERNING_SIZE];
    struct __cstr_data str;
    struct __cstr_node *next;
};

struct __cstr_pool {
    struct __cstr_node node[INTERNING_POOL_SIZE];
};

struct __cstr_interning {
    int lock;
    int index;
    unsigned size;
    unsigned total;
    struct __cstr_node **hash;
    struct __cstr_pool *pool;
};

// memory level

enum {
    CSTR_PERMANENT = 1,
    CSTR_INTERNING = 2,
    CSTR_ONSTACK = 4,
};

#define CSTR_INTERNING_SIZE (32)
#define CSTR_STACK_SIZE (128)

因為函式個別解釋會太過瑣碎，因此我以 test_cstr 為起始講解：

static void test_cstr()
{
    CSTR_BUFFER(a);
    cstr_cat(a, "Hello ");
    cstr_cat(a, "string");
    cstring b = cmp(CSTR_S(a));
    printf("%s\n", b->cstr);
    CSTR_CLOSE(a);
    cstr_release(b);
}

CSTR_BUFFER

以 macro 宣告一個以 cstr_buffer 為主的物件。
其中包含了 CSTR_STACK_SIZE 大小為 128 Bytes 儲存在 stack 的小字串。
__cstr_data 結構 ( aka cstring )，用來儲存資訊，是 cstr_buffer 唯一指向的物件。

#define CSTR_BUFFER(var)                                                      \
    char var##_cstring[CSTR_STACK_SIZE] = {0};                                \
    struct __cstr_data var##_cstr_data = {var##_cstring, 0, CSTR_ONSTACK, 0}; \
    cstr_buffer var;                                                          \
    var->str = &var##_cstr_data;

cstr_cat

逐字元增加字串，若 type 非 CSTR_ONSTACK 或是但超出 CSTR_STACK_SIZE 則進入 cstr_cat2 。
cstr_cat2 則會開始處理到中字串與大字串的範疇。
if (*str == 0) 為判斷加上的字串是否在 CSTR_STACK_SIZE 層級內加完。
s->cstr[i] = 0; 則是為了確保不要在進入 cstr_cat2 使接下來讀取字串產生問題。

cstring cstr_cat(cstr_buffer sb, const char *str)
{
    cstring s = sb->str;
    if (s->type == CSTR_ONSTACK) {
        int i = s->hash_size;
        while (i < CSTR_STACK_SIZE - 1) {
            s->cstr[i] = *str;
            if (*str == 0)
                return s;
            ++s->hash_size;
            ++str;
            ++i;
        }
        s->cstr[i] = 0;
    }
    cstring tmp = s;
    sb->str = cstr_cat2(tmp->cstr, str);
    cstr_release(tmp);
    return sb->str;
}

cstr_cat2

if (sa + sb < CSTR_INTERNING_SIZE) 是在判斷是否是在 CSTR_INTERNING 層級。
若是，則會在處理完字串後，進入 cstr_interning 內。
當不是時則進行 xalloc ，把 type 設 0 、ref 設 1 、hash_size 設 0 回傳到原先的 sb->str 。

static cstring cstr_cat2(const char *a, const char *b)
{
    size_t sa = strlen(a), sb = strlen(b);
    if (sa + sb < CSTR_INTERNING_SIZE) {
        char tmp[CSTR_INTERNING_SIZE];
        memcpy(tmp, a, sa);
        memcpy(tmp + sa, b, sb);
        tmp[sa + sb] = 0;
        return cstr_interning(tmp, sa + sb, hash_blob(tmp, sa + sb));
    }
    cstring p = xalloc(sizeof(struct __cstr_data) + sa + sb + 1);
    if (!p)
        return NULL;

    char *ptr = (char *) (p + 1);
    p->cstr = ptr;
    p->type = 0;
    p->ref = 1;
    memcpy(ptr, a, sa);
    memcpy(ptr + sa, b, sb);
    ptr[sa + sb] = 0;
    p->hash_size = 0;
    return p;
}

hash_blob

此函式為計算字串在 hash table 中的值。

static inline uint32_t hash_blob(const char *buffer, size_t len)
{
    const uint8_t *ptr = (const uint8_t *) buffer;
    size_t h = len;
    size_t step = (len >> 5) + 1;
    for (size_t i = len; i >= step; i -= step)
        h = h ^ ((h << 5) + (h >> 2) + ptr[i - 1]);
    return h == 0 ? 1 : h;
}

cstr_interning

找出在 __cstr_ctx 裡儲存的 hash 中符合 cstr 的字串。
現存總數 total 會加一。

static cstring cstr_interning(const char *cstr, size_t sz, uint32_t hash)
{
    cstring ret;
    CSTR_LOCK();
    ret = interning(&__cstr_ctx, cstr, sz, hash);
    if (!ret) {
        expand(&__cstr_ctx);
        ret = interning(&__cstr_ctx, cstr, sz, hash);
    }
    ++__cstr_ctx.total;
    CSTR_UNLOCK();
    return ret;
}

interning

在 struct __cstr_interning *si 中的 hash 尋找符合的字串。
以剛剛計算在 hash_blob 的 uint32_t hash 值除以 si->size - 1 的餘數為 index ，hash[index] 中開始尋找。
- 之所以能 hash & (si->size - 1) ，是因為有確保每次更新 si->size 都為二的倍數。
如果都沒找到則會新增，但當 hash 所擁有的紀錄達到其臨界值 (4/5 == total / size ) 會停止操作。
若否，則在 pool 新增新的 node 並在 hash[index] 標記。

疑慮，是否有確保 n = &si->pool->node[si->index++]; 在大於 INTERNING_POOL_SIZE 時能正常運作。
不論尋找到或是直接新增，回傳值均為 cstring 型態。

static cstring interning(struct __cstr_interning *si,
                         const char *cstr,
                         size_t sz,
                         uint32_t hash)
{
    if (!si->hash)
        return NULL;

    int index = (int) (hash & (si->size - 1));
    struct __cstr_node *n = si->hash[index];
    while (n) {
        if (n->str.hash_size == hash) {
            if (!strcmp(n->str.cstr, cstr))
                return &n->str;
        }
        n = n->next;
    }
    // 80% (4/5) threshold
    if (si->total * 5 >= si->size * 4)
        return NULL;
    if (!si->pool) {
        si->pool = xalloc(sizeof(struct __cstr_pool));
        si->index = 0;
    }
    n = &si->pool->node[si->index++];
    memcpy(n->buffer, cstr, sz);
    n->buffer[sz] = 0;

    cstring cs = &n->str;
    cs->cstr = n->buffer;
    cs->hash_size = hash;
    cs->type = CSTR_INTERNING;
    cs->ref = 0;

    n->next = si->hash[index];
    si->hash[index] = n;

    return cs;
}

expand

當 hash 的大小不夠或初次呼叫時，會呼叫此函式。
初次呼叫 hash 大小會直接是 HASH_START_SIZE 16。
unsigned new_size = si->size * 2; ，此操作能確保 size 為 2 的次方。
會以 insert_node 一一複製原先的 hash所擁有的資料。

static void expand(struct __cstr_interning *si)
{
    unsigned new_size = si->size * 2;
    if (new_size < HASH_START_SIZE)
        new_size = HASH_START_SIZE;

    struct __cstr_node **new_hash =
        xalloc(sizeof(struct __cstr_node *) * new_size);
    memset(new_hash, 0, sizeof(struct __cstr_node *) * new_size);

    for (unsigned i = 0; i < si->size; ++i) {
        struct __cstr_node *node = si->hash[i];
        while (node) {
            struct __cstr_node *tmp = node->next;
            insert_node(new_hash, new_size, node);
            node = tmp;
        }
    }

    free(si->hash);
    si->hash = new_hash;
    si->size = new_size;
}

cstr_release

釋放的條件是 type 不為零、 ref 為 0 。
5.44 Built-in functions for atomic memory access - __sync_sub_and_fetch

void cstr_release(cstring s)
{
    if (s->type || !s->ref)
        return;
    if (__sync_sub_and_fetch(&s->ref, 1) == 0)
        free(s);
}

至此，cstr_cat 的操作才算完成。

//TODO release close

接下來的函式是在針對單一字串進行讀取使所用的函式與巨集。

cstr_grab

可以看到在對不同的 type 有不同的對應。
當是 CSTR_PERMANENT 或 CSTR_INTERNING 會直接回傳。
但如果是 CSTR_ONSTACK 會進行 cstr_clone 。
- cstr_clone 則是在 s->hash_size 小於 CSTR_INTERNING_SIZE時，把 cstring s 記入到 hash 裡。
- 如超出大小則複製，並回傳副本。

cstring cstr_grab(cstring s)
{
    if (s->type & (CSTR_PERMANENT | CSTR_INTERNING))
        return s;
    if (s->type == CSTR_ONSTACK)
        return cstr_clone(s->cstr, s->hash_size);
    if (s->ref == 0)
        s->type = CSTR_PERMANENT;
    else
        __sync_add_and_fetch(&s->ref, 1);
    return s;
}

cstring cstr_clone(const char *cstr, size_t sz)
{
    if (sz < CSTR_INTERNING_SIZE)
        return cstr_interning(cstr, sz, hash_blob(cstr, sz));
    cstring p = xalloc(sizeof(struct __cstr_data) + sz + 1);
    if (!p)
        return NULL;
    void *ptr = (void *) (p + 1);
    p->cstr = ptr;
    p->type = 0;
    p->ref = 1;
    memcpy(ptr, cstr, sz);
    ((char *) ptr)[sz] = 0;
    p->hash_size = 0;
    return p;
}

uthash

關於 hash ，之前在看論壇時有人有推薦可以用 uthash 這個東西。
此並非是 C 的函式庫，只是個標頭檔因此使用起來很方便。

Add, find and delete are normally constant-time operations. This is influenced by your key domain and the hash function.
原始碼

蠻妙的是它是系列相關的專案，包含了 utstring 。

2 string interning in 多執行緒

`atomic` and `__sync`

atomic and __sync
- __sync 的實作在某些硬體上根本沒有，官方也推薦新的程式碼該用別的
- gcc - 6.54 Legacy __sync Built-in Functions for Atomic Memory Access
  
  These functions are implemented in terms of the ‘__atomic’ builtins (see __atomic Builtins). They should not be used for new code which should use the ‘__atomic’ builtins instead.
- gcc.gnu.org wiki atomic GCCMM

`cstr_cat` - 字串完整性

第一次測試程式碼，完整程式碼請點看 GitHub 。

static void test_cstr(void *buf) {
  cstr_buffer *a = (cstr_buffer *)buf;
  cstr_cat((*a), "a");
  cstr_cat((*a), "a");
}

static void test0(void) {
#ifdef times
#undef times
#endif
#define times 1000000
  for (int i = 0; i < times; i++) {
    CSTR_BUFFER(a);
    //CSTR_BUFFER(b);

    pthread_t thread_0, thread_1;
    pthread_create(&thread_1, NULL, (void *)test_cstr, (void *)a);
    pthread_create(&thread_0, NULL, (void *)test_cstr, (void *)a);
    pthread_join(thread_0, NULL);
    pthread_join(thread_1, NULL);

    CSTR_LITERAL(hello, "aaaa");
    if (!cstr_equal(hello, CSTR_S(a))) {
      printf("not equal : ");
      printf("%s\n", CSTR_S(a)->cstr);
    }
    CSTR_CLOSE(a);
    //CSTR_CLOSE(b);
  }
#undef times 
}

會出現：













$ gcc -o thread0 cstr.c thread.c -g -lpthread
$ ./thread0
not equal : a
not equal : aa
not equal : aa
not equal : a
not equal : a
not equal : a
not equal : aa
not equal : aa
not equal : aa
not equal : a
not equal : aa

之後做了修改，雖然是可以正常運行了，但 LOCK 的範圍太大，效率不高。

原先只有在 cstr_interning 函式亦即對 hash 操作時作保護，但對於字串操控還有 cstr_cat 這個函式。

cstring cstr_cat(cstr_buffer sb, const char *str)
{
    cstring s = sb->str;
    CSTR_LOCK();
    if (s->type == CSTR_ONSTACK) {
        int i = s->hash_size;
        while (i < CSTR_STACK_SIZE - 1) {
            s->cstr[i] = *str;
            if (*str == 0) {
                CSTR_UNLOCK();
                return s;
            }
            ++s->hash_size;
            ++str;
            ++i;
        }
        s->cstr[i] = 0;
    }
    cstring tmp = s;
    CSTR_UNLOCK();
    sb->str = cstr_cat2(tmp->cstr, str);
    cstr_release(tmp);
    return sb->str;
}

但是上述 test_cstr 沒辦法檢測對 string interning 的操作，因此我改成如下：

static void test_cstr(void *buf) {
  cstr_buffer *a = (cstr_buffer *)buf;
  cstr_cat((*a), "aaaa");
  (*a)->str = cstr_grab(CSTR_S(*a));
  cstr_cat((*a), "aaaa");
}

字串變成需要 16 個 a ，每次增加 4 個 a 。
當處理兩個不同字串時是可以的。
但仍就沒有完全成功。在處理同字串會失敗，因為函式間沒法保證讀取時會是預期狀況。
在 10000 次測試下，會有 4951 次會是失敗的，跟擲骰子差不多完全不可靠。

最後解決方案，因為考量到沒辦法在處理字串時每個步驟都 lock ，因此直接在另設個 lock 在呼叫時包起來。

但這有點治標不致本，會使用者須額外負擔分辨是否操作的視同一個字串，並且還要自己設字串的 lock 。

volatile atomic_flag alock = ATOMIC_FLAG_INIT;

#define a_lock()\
({while (atomic_flag_test_and_set(&alock));})

#define a_unlock()\
({atomic_flag_clear(&(alock));})

#define cstr_cat_s(s, a) do{\
a_lock();\
cstr_cat((s), a);\
a_unlock();} while (0)

#define cstr_grab_s(from, to) do{\
a_lock();\
(to) = cstr_grab(from);\
a_unlock();} while (0)

cstr_cat_s((*a), "aaaa");
cstr_grab_s(CSTR_S(*a), (*a)->str);
cstr_cat_s((*a), "aaaa");



gcc -o thread_lock cstr_t.c thread.c -g -lpthread
$ for i in {1..10}; do ./thread_lock;  done
$

ref - 操作的保證

目標為在多執行緒下，能夠掌控 ref 的數值大小。
因此以 4 到 10 個執行緒測試了以 cstr_grab 和 cstr_release 對 ref 進行加減的實驗。
在重複執行100次時看似沒問題，但拉高到1000次就可以看到有時候會出錯如：

ref: 3, type 0
ref: 0, type 32750

type 會是 32750 是因為保存其的 cstring s 已被釋放掉。

被釋放掉則是因為 ref 在增加之前已被減為零。

static void test1(void) {
#define times 1
#define thread_n 4
  for (int i = 0; i < times; i++) {
    CSTR_BUFFER(a);
        cstr_cat_s(a, "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC");
    pthread_t thread[thread_n];
    int j;
    for (j = 0; j < 2; j++)
      pthread_create(&thread[j], NULL, (void *)test_ref_create_ns, (void *)a);
    for (;j < 4;j++)
      pthread_create(&thread[j], NULL, (void *)test_ref_del_ns, (void *)a);
    for (int j = 0; j < thread_n; j++)
      pthread_join(thread[j], NULL);
    printf("ref: %d, type %d\n", a->str->ref, a->str->type);
    CSTR_CLOSE(a);
  }
#undef times
}

static void test_ref_create_ns(void *buf) {
  cstr_buffer *a = (cstr_buffer *)buf;
  (*a)->str = cstr_grab(CSTR_S(*a)); //add
  (*a)->str = cstr_grab(CSTR_S(*a)); //add
}

static void test_ref_del_ns(void *buf) {
  cstr_buffer *a = (cstr_buffer *)buf;
  cstr_release((*a)->str); //sub
}

因此就如 cat 一樣方式處理，但須加上 sleep() ：

關於老師所說

這是什麼可怕的解法？

是可以運用 reader and writer 的方法來處理。
設一個減值的 lock 擋住 release 但不擋住增值操作，之後再設一個 semaphore 來記錄目前增值有幾個。當那個 semaphore 為零代表增值操作已經完成， lock 才解除讓減值操作。

但當時候只為了區分兩種操作，為了讓實驗方便操作所以才直接設 sleep() 。
```
static void test_ref_create(void *buf) {
  cstr_buffer *a = (cstr_buffer *)buf;
  cstr_grab_s(CSTR_S(*a), (*a)->str);
  cstr_grab_s(CSTR_S(*a), (*a)->str);

}

static void test_ref_del(void *buf) {
  cstr_buffer *a = (cstr_buffer *)buf;
  cstr_release_s((*a)->str);
}
```
之後試了幾種組合，如果說要對 ref 在個執行緒保持一致性對函式 lock 即可。
但如果是要以要保證最後的 ref 值為何則須以分開增減值操作。
- detail
然而上述所的不包含 hash 存放的值，因為其以保持在 ref 為零的狀態且只可讀取。

總結

雖然有對共享值作如 __sync_sub_and_fetch 、 __sync_add_and_fetch 等保護，但在讀取數值與執行順序上卻沒有保證，關於這概念可以看並行程式設計: 執行順序 - Sequential Consistency。
簡單來說，存到 cache 要作增減的值必須要保持在最新的狀態，而當其他執行緒要去存取時也要能夠看到前者的更新資訊。

Reference

3 chriso/intern 探討與分析

Jim Huang

2021/03/08 08:24:50

利用 `#include <sys/resource.h>` 裡的 `struct rusage` 來記錄每 1ms 記憶體的使用量

為何不用 valgrind --tool=massif 呢？ (Edited)

linD026

2021/03/08 08:29:08

這幾天都在查怎麼測量記憶體，，一直找不到如何測量stack的方法。 (Edited)

2021/03/08 08:29:47

只查到 valgrind 可以檢測記憶體缺失相關的功能，沒想到有這個XD (Edited)

2021/03/08 08:41:49

我是有找到 callgrind 但需要額外下載 GUI 才比較方便就沒採用了 (Edited)

2021/03/08 10:27:49

除了 n = 1 比較高外

或許可對 list-sort.c 提交程式碼貢獻，提出 n = 1 的特別處理 (fast path) (Edited)

2021/03/13 08:04:44

目錄的英語是 "directory"，檔案夾的英語是 "folder"，不要混用！ (Edited)

2021/03/14 01:54:49

查看 alpha 會是

舉例解說的微處理器架構用 x86, x86_64, arm, aarch64, risc-v 等架構為主，你可以忽略 Alpha (Edited)

2021/03/21 20:51:30

整程式碼請點看 [GitHub](https://github.com/linD026/linux2021_homework_quiz2_string_interning)

請提供 Makefile (Edited)

2021/03/21 21:42:07

須加上 `sleep()

這是什麼可怕的解法？ (Edited)

Yiwei Lin

2021/03/22 16:35:58

當處理兩個不同字串時是可以的

同學好，有些疑惑之處不曉得能不能請教。想請問不同字串可以得到預期結果而相同則無法的原因，或者不同字串的實驗具體是怎麼設計的呢? 謝謝 (Edited)

2021q1 Homework2 (quiz2)

tags: linux2021

0 進度

測驗 1

1 解釋程式運作原理 與 可改進

結構定義

函式說明

1. get_middle

2. list_merge

3. list_merge_sort

改良結構與連結方式

4. q_new

5. q_free

6. q_insert_head

2 Linux 核心 - lib/list_sort.c

實作可單獨執行 (standalone) 的使用層級應用程式

list_sort.c

list.h

設計檢測

說明 lib/list_sort.c 最佳化手法

list_sort( )

3 效能比較

n = 1 stack 使用量大問題

測驗 2

1 解釋程式運作原理

C11 6.3 Conversions

2 is_power_of_2

說明 is_power_of_2 原理

考量

3 slab allocator

slab allocator 簡介

is_power_of_2 在 Linux kernel 的使用

測驗 3

1 解釋程式運作原理 與 重寫

解釋

改寫

2 Linux 核心內的 bit 複製

圖形處理的 Framebuffer

user and kernel space

額外

測驗 4

1 解釋程式運作原理

結構定義

uthash

2 string interning in 多執行緒

atomic and __sync

cstr_cat - 字串完整性

ref - 操作的保證

總結

3 chriso/intern 探討與分析

Read more

Linux 核心採納 Rust 的狀況

[Linux 核心 Copy On Write 實作機制](https://hackmd.io/@linD026/Linux-kernel-COW-content)

Linux 核心 Read-Copy Update 筆記整理

[Linux 核心 Copy On Write - Memory Region](https://hackmd.io/@linD026/Linux-kernel-COW-content)

tags: `linux2021`

測驗 `1`

1 解釋程式運作原理與可改進

測驗 `2`

測驗 `3`

1 解釋程式運作原理與重寫

測驗 `4`

`atomic` and `__sync`

`cstr_cat` - 字串完整性