contributed by < Kuanch >
Lccgth
佇列實作部分應只貼出關鍵程式碼,例如 q_merge 中可省略變數宣告的程式碼,以文字敘述。
for (struct list_head *p = head->next->next; p != head; p = p->next) {
queue_contex_t *node = list_entry(p, queue_contex_t, chain);
list_splice_init(node->q, merged_list->q);
}
q_sort(merged_list->q, descend);
在佇列實作部分可列出相應的 git commit 紀錄,方便日後優化時進行比較。
在 q_sort 與 list_sort 效能比較中,將 list_sort 的 likely 和 unlikely 巨集進行更動,會影響編譯時的最佳化,可以參考 gcc 13.5 第 6.59 章中的 __builtin_expect。
Kuanch
lintin528
github 上佇列實作分為多次 commit 在維護的時候會較方便。
$ uname -a
Linux 5.15.0-100-generic #110~20.04.1-Ubuntu SMP
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 140
Model name: 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz
Stepping: 1
CPU MHz: 776.504
CPU max MHz: 4200.0000
CPU min MHz: 400.0000
BogoMIPS: 4838.40
Virtualization: VT-x
L1d cache: 192 KiB
L1i cache: 128 KiB
L2 cache: 5 MiB
L3 cache: 8 MiB
NUMA node0 CPU(s): 0-7
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
$ gcc --version
gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ valgrind --version
valgrind-3.15.0
$ cppcheck --version
Cppcheck 1.90
你的洞見呢?
q_sort
程式碼列出是為了討論和檢討,倘若你沒有進行討論,就不要列出。讀者要看你的洞見。
Kuanch 依照老師和同學建議減少張貼的程式碼
q_sort
要求 In-place 操作,故使用 LIST_HEAD(new_head);
為實作重點;對比於 malloc
的動態分配(Dynamic Allocation),此處為靜態分配(Static Allocation),後者有幾項特點
無論標題和內文中,中文和英文字元之間要有空白字元 (對排版和文字搜尋有利)。
留意各式細節,唯有重視小處並步步為營,方可挑戰原始程式碼規模超越三千七百萬行的 Linux 核心。
q_merge
本例中將前述常用的結構 element_t
改為 queue_context_t
,每一個包含 list_head
的 queue_context_t
帶有一個佇列開頭 node->q
,透過 list_splice_init
合併到第一個 queue_context_t
當中的佇列,之後便可進行排序。
我認為 queue_context_t
特別之處在於每一個元素都能夠再帶有一個佇列,所以也許可以考慮以下這種應用
#include <stdio.h>
#include <stdlib.h>
struct list_head {
struct list_head *next, *prev;
};
typedef struct {
struct list_head list;
int id;
int runtime;
int priority;
double energy_cost;
// other properties of tasks
// ...
} task_t;
typedef struct {
struct list_head q; // Head of the queue of tasks
struct list_head *other_task_queue;
int size; // Number of tasks in the queue
int id; // Identifier for the queue
} queue_context_t;
void link_task_queue(queue_context_t queue1, queue_context_t queue2) {
queue1.other_task_queue = &queue2.q;
queue2.other_task_queue = &queue1.q;
}
int main() {
queue_context_t queue_task_type1;
queue_context_t queue_task_type2;
// initialize task queue, enqueue, schedule etc.
// ...
// link task queue
link_task_queue(queue_task_type1, queue_task_type2);
return 0;
}
假設 queue_task_type1
和 queue_task_type2
為性質相似的任務佇列,在內可以儲存各自的任務,亦可走訪有相似性質的其他任務佇列。
container_of
後來發現 Jserv 有更詳細的專文討論,請見Linux 核心原始程式碼巨集: container_of
本來就是指定教材,為何你要捨近求遠呢?
jserv
引用
請不要小看這巨集,畢竟大量在 Linux 核心原始程式碼採用的巨集,應有其獨到之處。在 container_of 巨集出現前,程式設計的思維往往是:
- 給定結構體起始地址
- 求出結構體特定成員的記憶體內容
- 傳回結構體成員的地址,作日後存取使用 …
理解為 ->
有了 container_of
後,可以依結構體成員的地址反向查詢結構體起始地址,無須儲存結構體起始地址、無須存取結構體成員的地址
另外是否也有功能是,可以作為統一的鏈結串列的介面,對於不同的應用,只需要設計不同的 element_t
並加入 struct list_head *head
,便可使用 Linux Kernel 的固有函式走訪成員?
container_of
是 linux kernel 實作佇列結構的重要函式,我認為有必要分析並了解其如何運作
#ifndef container_of
#ifdef __LIST_HAVE_TYPEOF
#define container_of(ptr, type, member) \
__extension__({ \
const __typeof__(((type *) 0)->member) *__pmember = (ptr); \
(type *) ((char *) __pmember - offsetof(type, member)); \
})
#else
#define container_of(ptr, type, member) \
((type *) ((char *) (ptr) - offsetof(type, member)))
#endif
#endif
讓我們觀察下方未定義 __LIST_HAVE_TYPEOF
的版本開始;當我們已知 list_head
的記憶體位置 ptr
且已知結構大小,便可反向推算 element_t
之記憶體位置;考慮以下程式碼
#include <stddef.h> // For offsetof
#include <stdio.h>
struct list_head {
struct list_head *next;
struct list_head *prev;
};
struct element_t {
int data;
struct list_head list;
};
int main() {
struct element_t element;
struct list_head *list_ptr = &element.list;
printf("size of element_t: %lu\n", sizeof(struct element_t)); // 24
printf("Original list_head address: %p\n",
(void *)list_ptr); // 0x7ffe406db2a8
printf("the offset of list in element_t: %lu\n", // 8
offsetof(struct element_t, list));
// Calculate the start of element_t from list_ptr
struct element_t *element_ptr =
(struct element_t *)((char *)list_ptr - offsetof(struct element_t, list));
printf("Original element_t address: %p\n",
(void *)&element); // 0x7ffe406db2a0
printf(
"Calculated element_t address using container_of: %p\n", // 0x7ffe406db2a0
(void *)element_ptr);
return 0;
}
有幾處需要說明
offsetof(struct element_t, list)
應該為 4,因為 element_t
的位址應該是 list
的位址 - sizeof(data)
,但此處卻為 8?
Ans: 因為 data alignment,參考你所不知道的 C 語言:記憶體管理、對齊及硬體特性
如何驗證?使用 __attribute__
修改 element_t
struct __attribute__((packed)) element_t {
int data;
struct list_head list;
};
// After introducing __attribute__
// offsetof(struct element_t, list) == 4
// sizeof(struct element_t) == 20
(char *)
且以 (void *)
取得?
Ans: 是兩件事情。轉型至 (char *)
係因其為 1 byte,能夠在 byte 之上操作。 (void *)
稱做通用指標,此處僅為與 %p
互動輸出正確結果。查閱 C 語言規格書,看是否存在「通用」指標這說法。
Kuanch 沒有! 查閱 ISO/IEC 9899:2018 不存在 "generic pointer" (我自作主張翻譯作通用指標) 一詞 引用 ISO/IEC 9899:2018
A pointer to void may be converted to or from a pointer to any object type. A pointer to any object type may be converted to a pointer to void and back again; the result shall compare equal to the original pointer.
更多的是稱作 "a pointer to void",文中共提到 10 次。 搜尋引擎的結果,包含它本身都錯把 generic pointer 與
(void *)
畫上等號。
lib/list_sort.c
lib/list_sort.c
原先想直接與 linux/lib/list_sort.c 編譯,但屢試無果,故參考 komark06 的方式移植到 queue.c
2023q1 Homework1 (lab0)。
「本該如此」的事,就不要寫。
merge
-__attribute__((nonnull(2,3,4)))
-static struct list_head *merge(void *priv, list_cmp_func_t cmp,
- struct list_head *a, struct list_head *b)
+__attribute__((nonnull)) struct list_head *merge(struct list_head *a,
+ struct list_head *b)
merge_final
- if (unlikely(!++count))
- cmp(priv, b, b);
+ // if (!++count)
+ // strcmp(list_entry(b, element_t, list)->value,
+ // list_entry(b, element_t, list)->value);
list_sort
- if (likely(bits)) {
- struct list_head *a = *tail, *b = a->prev;
+ if (bits) {
+ struct list_head *a = *tail, *b = a->prev;
使用 perf
進行分析,測試程式如下
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include "queue.h"
int main() {
struct list_head *q = q_new();
int list_size = 100000;
for (int i = 0; i < list_size; i++) {
char s = (char) (rand() % 26 + 'a');
q_insert_tail(q, &s);
}
// q_sort(q, true);
list_sort(q);
q_free(q);
}
以下使用 perf state -r 5
perf stat -r 5
,可以參考 Linux 效能分析工具: Perf
Performance counter stats for './q_sort' (5 runs):
7,2825.90 msec task-clock # 1.000 CPUs utilized ( +- 0.28% )
362 context-switches # 4.968 /sec ( +- 6.35% )
8 cpu-migrations # 0.115 /sec ( +- 22.14% )
3582 page-faults # 49.186 /sec ( +- 0.01% )
2957,7153,0020 cycles # 4.061 GHz ( +- 0.22% )
625,4618,0370 instructions # 0.21 insn per cycle ( +- 0.00% )
208,1336,9257 branches # 285.796 M/sec ( +- 0.00% )
124,9445 branch-misses # 0.01% of all branches ( +- 0.59% )
1,4784,0455,5899 slots # 20.301 G/sec ( +- 0.22% )
285,2832,2723 topdown-retiring # 1.9% retiring ( +- 1.54% )
846,9593,7809 topdown-bad-spec # 5.7% bad speculation ( +- 8.49% )
69,6204,8854 topdown-fe-bound # 0.5% frontend bound ( +- 2.68% )
1,3755,0196,3845 topdown-be-bound # 92.0% backend bound ( +- 0.25% )
72.831 +- 0.206 seconds time elapsed ( +- 0.28% )
Performance counter stats for './list_sort' (5 runs):
7,3189.10 msec task-clock # 1.000 CPUs utilized ( +- 0.17% )
384 context-switches # 5.252 /sec ( +- 5.30% )
12 cpu-migrations # 0.158 /sec ( +- 34.55% )
3582 page-faults # 48.942 /sec ( +- 0.01% )
2972,8367,5031 cycles # 4.062 GHz ( +- 0.18% )
626,1171,8129 instructions # 0.21 insn per cycle ( +- 0.00% )
208,3703,4674 branches # 284.701 M/sec ( +- 0.00% )
126,7828 branch-misses # 0.01% of all branches ( +- 1.30% )
1,4852,8263,7616 slots # 20.294 G/sec ( +- 0.17% )
282,5978,1184 topdown-retiring # 1.9% retiring ( +- 1.87% )
908,7784,9886 topdown-bad-spec # 6.0% bad speculation ( +- 6.01% )
63,9197,5271 topdown-fe-bound # 0.4% frontend bound ( +- 4.39% )
1,3783,7450,0815 topdown-be-bound # 91.7% backend bound ( +- 0.33% )
73.193 +- 0.123 seconds time elapsed ( +- 0.17% )
整理重要項為表格 :
my sort | list sort | |
---|---|---|
task clock | 72825.9 msec | 73189.1 msec |
context switches | 362 | 384 |
page faults | 3582 | 3582 |
branches misses | 124,9445 | 126,7828 |
在作業說明中提到
用 bottom up 實作 merge sort 對 cache 較友善,因為過程中就是一直合併,cache 被參照到的機會更大。 而 top down 是會先做 partition 再來 merge,但 partition 本身對 cache 不友善,在 cache 移進移出(內容不斷更新,導致 cache thrashing。
以及 改進 lib/list_sort.c 中提到
lib/list_sort.c 中實作的 merge sort 演算法採用 bottom-up 策略。它首先將給定的串列轉換為一個以空結尾的單向鏈結串列。該演算法維護一個稱為 "pending" 的數據結構,表示等待進一步合併的已排序子串列。
雖然可以見到幾乎無差距,但原先設想是,我的實作方法是遞迴,每次都會切出一半的資料進行兩次遞迴(即我理解的 top-down 而非 bottom-up),理論上應該會導致 cache thrashing、有大量的 page-faults?
此處效能測量可能有誤,待釐清。
3 / 7 更新 可能被編譯器最佳化了?除了調整最佳化等級,可以用 perf + 觀察 asm 看是否有顯著差距
以下實驗針對幾點做改進
perf
試驗數我們依然分別隨機產生資料,但本次係從 10 個資料點,每次增加 10 個資料並進行排序,直到 20000 個資料點,紀錄 CPU cycles 後繪製成下圖
可以見到我所實作的 q_sort
隨著資料增加,CPU cycles 有指數增加的趨勢,相對於 list sort 接近於線性的增加,在大量資料的狀況下已有明顯差距,以下為測試程式碼
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <stdint.h>
#include <x86intrin.h>
#include "queue.h"
#include "dudect/cpucycles.h"
void static inline q_sort_wrapper(struct list_head *head) {
bool predefined_descend = true;
q_sort(head, predefined_descend);
}
int main(int argc, char** argv) {
if (argc != 2) {
printf("Usage: %s <sort_name>\n", argv[0]);
return 1;
}
char* sort_name = argv[1];
char* save_file;
void (*sort_func)(struct list_head *head);
if (strcmp(sort_name, "qsort") == 0) {
sort_func = &q_sort_wrapper;
save_file = "qsort_perf.txt";
}
else if (strcmp(sort_name, "listsort") == 0) {
sort_func = list_sort;
save_file = "listsort_perf.txt";
}
else {
printf("Invalid sort name: %s\n", sort_name);
return 1;
}
struct list_head *q = q_new();
FILE *fileout = fopen(save_file, "w");
for (int num_data = 10; num_data <= 20000; num_data += 10) {
q = q_new();
for (int i = 0; i < num_data; i++) {
char s = (char) (rand() % 26 + 'a');
q_insert_tail(q, &s);
}
_mm_mfence();
int64_t start = cpucycles();
sort_func(q);
int64_t end = cpucycles();
// save the result to text for analysis later
fprintf(fileout, "%d %ld\n", num_data, end - start);
}
}
多少樣本數才足以取得具有代表性的統計量?
問題的答案因母體分配而異,若假設母體為常態分佈,則根據 《Educational research competencies for analysis and application》 (Gay, L. R., 1992),至少需有 30 個樣本數,小於 30 則影響大多數檢定之檢定力,如 z-test;又根據 〈A Practical Guide for Using Statistical Tests to Assess Randomized Algorithms in Software Engineering〉(Andrea Arcuri, 2011),至少 30 次試驗是一個普遍適用的標準,但對於部分檢定,如 U test,100 次仍不足,而 1000 次則是推薦的次數。
以隨機產生的資料,每次有 20000 筆資料,進行 1000 次排序,刪除了大於第 99 分位數 (
作圖後可以顯著地看到兩者分布的有顯著差異,我們可以判斷兩者母體服從於不同分布;而就視覺上看,其母體可能非常態分布,若需要執行統計檢定需要注意,可以透過 K-S 檢定判斷是否服從常態分布。
perf
試驗數和不同樣本特性本次除了增加 perf
的試驗數外,每一次試驗都將進行一千次的排序,即
int num_test = 1000;
for (int test = 0; test < num_test; test++) {
q = q_new();
data_prod(q);
// print_list(q);
_mm_mfence();
int64_t start = cpucycles();
sort_func(q);
int64_t end = cpucycles();
// printf("Cycles: %ld\n", end - start);
// print_list(q);
fprintf(fileout, "%ld ", end - start);
}
猜測 list_sort
可能在特定場景能發揮更好的效能,故我們準備 2 種資料,依據這 2 種資料,使用 perf stat -r 1000
得到以下四種結果
我們以以下程式碼產生資料
void static inline almost_sorted_data(struct list_head *head) {
for (int i = 0; i < 10000; i++) {
char s = (char) (rand() % 26 + 'a');
q_insert_tail(head, &s);
}
q_sort(head, false);
for (int i = 0; i < 100; i++) {
char s = (char) (rand() % 26 + 'a');
q_insert_tail(head, &s);
}
}
Performance counter stats for './sort_dist qsort almost_sort' (1000 runs):
3137.17 msec task-clock # 0.998 CPUs utilized ( +- 0.01% )
8 context-switches # 2.544 /sec ( +- 1.39% )
0 cpu-migrations # 0.000 /sec
35,5163 page-faults # 112.949 K/sec ( +- 0.00% )
131,3528,1701 cycles # 4.177 GHz ( +- 0.01% )
206,3582,6760 instructions # 1.57 insn per cycle ( +- 0.00% )
45,1489,8824 branches # 1.436 G/sec ( +- 0.00% )
1,0356,1613 branch-misses # 2.29% of all branches ( +- 0.01% )
656,3148,1975 slots # 20.872 G/sec ( +- 0.01% )
203,3289,0494 topdown-retiring # 30.9% retiring ( +- 0.02% )
154,4270,1641 topdown-bad-spec # 23.2% bad speculation ( +- 0.06% )
118,3940,4591 topdown-fe-bound # 18.1% frontend bound ( +- 0.03% )
180,5324,4299 topdown-be-bound # 27.8% backend bound ( +- 0.03% )
3.144975 +- 0.000249 seconds time elapsed ( +- 0.01% )
Performance counter stats for './sort_dist listsort almost_sort' (1000 runs):
2889.82 msec task-clock # 0.999 CPUs utilized ( +- 0.01% )
7 context-switches # 2.421 /sec ( +- 1.51% )
2 cpu-migrations # 0.692 /sec ( +- 1.12% )
35,5164 page-faults # 122.833 K/sec ( +- 0.00% )
120,8325,0954 cycles # 4.179 GHz ( +- 0.01% )
189,2326,3768 instructions # 1.57 insn per cycle ( +- 0.00% )
41,4924,8982 branches # 1.435 G/sec ( +- 0.00% )
1,0522,7313 branch-misses # 2.54% of all branches ( +- 0.01% )
603,1795,1240 slots # 20.861 G/sec ( +- 0.01% )
189,2327,8820 topdown-retiring # 31.3% retiring ( +- 0.01% )
151,3862,3056 topdown-bad-spec # 25.4% bad speculation ( +- 0.05% )
120,6359,0248 topdown-fe-bound # 20.0% frontend bound ( +- 0.02% )
141,9245,9115 topdown-be-bound # 23.3% backend bound ( +- 0.04% )
2.892066 +- 0.000265 seconds time elapsed ( +- 0.01% )
整理重要項為表格 :
my sort | list sort | |
---|---|---|
task clock | 3137.17 msec | 2889.82 msec |
context switches | 8 | 7 |
page faults | 35,5163 | 35,5164 |
instructions | 206,3582,6760 | 189,2326,3768 |
branches misses | 1,0356,1613 | 1,0522,7313 |
以以下程式碼產生測試資料
void static inline reverse_sorted_data(struct list_head *head) {
for (int i = 0; i < 10000; i++) {
char s = (char) (rand() % 26 + 'a');
q_insert_head(head, &s);
}
q_sort(head, true);
}
Performance counter stats for './sort_dist qsort reverse_sort' (1000 runs):
3198.79 msec task-clock # 1.002 CPUs utilized ( +- 0.01% )
11 context-switches # 3.447 /sec ( +- 1.05% )
1 cpu-migrations # 0.313 /sec ( +- 2.77% )
35,1647 page-faults # 110.201 K/sec ( +- 0.00% )
133,7968,5032 cycles # 4.193 GHz ( +- 0.01% )
206,4549,2457 instructions # 1.55 insn per cycle ( +- 0.00% )
44,8644,2627 branches # 1.406 G/sec ( +- 0.00% )
1,0598,1816 branch-misses # 2.36% of all branches ( +- 0.01% )
666,4662,4310 slots # 20.886 G/sec ( +- 0.01% )
201,2466,6948 topdown-retiring # 30.4% retiring ( +- 0.02% )
167,2699,5905 topdown-bad-spec # 24.8% bad speculation ( +- 0.06% )
111,9575,3395 topdown-fe-bound # 16.8% frontend bound ( +- 0.05% )
188,1787,0393 topdown-be-bound # 28.0% backend bound ( +- 0.03% )
3.191458 +- 0.000349 seconds time elapsed ( +- 0.01% )
Performance counter stats for './sort_dist listsort reverse_sort' (1000 runs):
2869.04 msec task-clock # 0.997 CPUs utilized ( +- 0.01% )
13 context-switches # 4.517 /sec ( +- 0.78% )
0 cpu-migrations # 0.000 /sec
35,1649 page-faults # 122.183 K/sec ( +- 0.00% )
120,0394,6287 cycles # 4.171 GHz ( +- 0.01% )
191,8827,7875 instructions # 1.59 insn per cycle ( +- 0.00% )
40,7025,6305 branches # 1.414 G/sec ( +- 0.00% )
1,0781,1929 branch-misses # 2.65% of all branches ( +- 0.01% )
600,1660,7175 slots # 20.853 G/sec ( +- 0.01% )
190,6409,8749 topdown-retiring # 31.8% retiring ( +- 0.02% )
169,4586,5555 topdown-bad-spec # 27.8% bad speculation ( +- 0.05% )
105,9116,5972 topdown-fe-bound # 18.0% frontend bound ( +- 0.03% )
134,1547,6897 topdown-be-bound # 22.4% backend bound ( +- 0.04% )
2.878567 +- 0.000240 seconds time elapsed ( +- 0.01% )
整理重要項為表格 :
my sort | list sort | |
---|---|---|
task clock | 3198.79 msec | 2869.04 msec |
context switches | 11 | 13 |
page faults | 35,1647 | 35,1649 |
instructions | 206,4549,2457 | 191,8827,7875 |
branches misses | 1,0598,1816 | 1,0781,1929 |
新實驗後我們可以暫時整理三點:
和老師討論後發現原先 perf
並沒有顯示出 Cache 項次,在補上了 likely
及 unlikely
後,數據如下
Performance counter stats for './perf_sort qsort' (1000 runs):
1,1367,2224 L1-dcache-loads ( +- 0.08% ) (36.89%)
1,0174,7077 L1-dcache-loads-misses # 91.49% of all L1-dcache accesses ( +- 0.01% ) (37.03%)
40,2304 LLC-loads ( +- 0.83% ) (37.03%)
3121 LLC-loads-misses # 0.91% of all LL-cache accesses ( +- 3.67% ) (37.03%)
1,0315,2129 l1d.replacement ( +- 0.02% ) (36.12%)
2,2600 l1d_pend_miss.l2_stall ( +- 9.23% ) (35.98%)
113,4894 mem_load_retired.l1_hit ( +- 0.55% ) (35.98%)
1,0398,5941 mem_load_retired.l1_miss ( +- 0.01% ) (35.98%)
8,4183,5218 cycle_activity.cycles_l1d_miss ( +- 0.02% ) (35.98%)
8,4118,9368 cycle_activity.stalls_l1d_miss ( +- 0.02% ) (35.98%)
1,0465,9441 mem_inst_retired.all_loads ( +- 0.01% ) (35.98%)
0.3565312 +- 0.0000646 seconds time elapsed ( +- 0.02% )
Performance counter stats for './perf_sort listsort' (1000 runs):
1,1180,5254 L1-dcache-loads ( +- 0.06% ) (35.94%)
1,0282,2597 L1-dcache-loads-misses # 92.29% of all L1-dcache accesses ( +- 0.01% ) (37.05%)
55,1223 LLC-loads ( +- 0.79% ) (37.88%)
5174 LLC-loads-misses # 1.06% of all LL-cache accesses ( +- 3.52% ) (37.88%)
1,0396,6378 l1d.replacement ( +- 0.01% ) (37.43%)
2,7826 l1d_pend_miss.l2_stall ( +- 5.49% ) (36.33%)
102,7050 mem_load_retired.l1_hit ( +- 2.02% ) (35.50%)
1,0478,8193 mem_load_retired.l1_miss ( +- 0.01% ) (35.50%)
8,5502,9344 cycle_activity.cycles_l1d_miss ( +- 0.02% ) (35.50%)
8,5046,0227 cycle_activity.stalls_l1d_miss ( +- 0.02% ) (35.50%)
1,0576,1424 mem_inst_retired.all_loads ( +- 0.01% ) (35.50%)
0.3605272 +- 0.0000668 seconds time elapsed ( +- 0.02% )
整理重要項為表格 :
my sort | list sort | |
---|---|---|
task clock | 3198.79 msec | 2869.04 msec |
context switches | 11 | 13 |
page faults | 35,1647 | 35,1649 |
instructions | 206,4549,2457 | 191,8827,7875 |
branches misses | 1,0598,1816 | 1,0781,1929 |
在 queue.c
加入
+bool do_lsort(int argc, char *argv[]) {...}
...
+ ADD_COMMAND(lsort, "Sort the queue with list sort in Linux Kernel", "");
即可使用 lsort
呼叫 list_sort
進行排序。
我們也可以加入 option time_sort
,用以快速測量排序的速度
+ struct timespec start, end;
+ if (current && exception_setup(true)) {
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ list_sort(current->q);
+ clock_gettime(CLOCK_MONOTONIC, &end);
+ if (time_sort) {
+ uint64_t delta_us = (end.tv_sec - start.tv_sec) * 1000000 + (end.tv_nsec - start.tv_nsec) / 1000;
+ report(4, "Time to sort: %ld microsecond", delta_us);
+ }
+ }
...
+ add_param("time_sort", &time_sort, "If report the time of sort", NULL);
效果如下
option fail 0
option malloc 0
option time_sort 1
new
ih RAND 100000
lsort
free
// l = []
// l = [rajxbcvef qxakpf adhydugvm ... ]
// Time to sort: 30542 microsecond
// l = [aaacprq aaahp aaalbkd ... ]
實作請見 Github commit 使用作業說明中提供的測試程式,計算卡方值以及繪圖
由於我們希望檢驗演算法輸出的資料是否符合均勻分布 (uniform distribution),故我們假設期望值
即隨機 shffle 100000次,平均分配到 24 種狀況,每種狀況平均應該出現 4166 次。
依照測試程式計算,樣本之卡方值為 30.16,依照自由度 5 查表,由於 20.52 < 30.16,得到 p-value = 0.001,故我們可以推論,在設定
能夠見到畫出的圖片亦近似於均勻分布
完整開發請見 Github branch fix_dudect
不要濫用 TL;DR,誠實面對自己,闡述你的洞見,並接受他人批評指教,從而及早發現自身認知不足。
本節做到的是
Dude, is my code constant time? 使用統計方法判斷固定輸入資料與隨機輸入資料的執行時間是否有不同,若有則表示演算法可能會因為輸入資料而有不同的效能,導致了非線性複雜度的可能。(結合 Time attack,應該有更好的表達方式,再修改)
We first measure the execution time for two different input data classes, and then check whether the two timing distributions are statistically different.
文中亦闡明了演算法的過程,即
我們可以依照上述過程找到專案中的相應程式碼,亦可參照該論文釋出的程式碼 dudect.h。
首步為計算 CPU cycles,定義於 dudect/cpucycles.h
static inline int64_t cpucycles(void)
{
#if defined(__i386__) || defined(__x86_64__)
unsigned int hi, lo;
__asm__ volatile("rdtsc\n\t" : "=a"(lo), "=d"(hi));
return ((int64_t) lo) | (((int64_t) hi) << 32);
#elif defined(__aarch64__)
// ....
#endif
}
透過指令 rdtsc
,即 Read Time-Stamp Counter,該指令會傳回自 CPU 啟動以來的循環 cycle 數,其中 lo
為低位的 32 位元,hi 為高位的 32 位元,((int64_t) lo) | (((int64_t) hi) << 32)
即可組成 CPU cycles。
這樣的計算具有幾個特徵,
特別的是,dudect.h 中的 cpucycles()
為
static inline int64_t cpucycles(void) {
_mm_mfence();
return (int64_t)__rdtsc();
}
_mm_mfence()
應是指令 MFENCE
,其作用為確保在指令 rdtsc
之前,先前的所有記憶體讀寫都應該被完成,故在計算 CPU cycles 時便不會引入先前其他指令。
MFENCE
能夠使 rdtsc
更準確計算 CPU cycles (ChatGPT)
- Serialization of Memory Operations: MFENCE ensures that all load (read) and store (write) operations that appear before the MFENCE in the program order are completed before any load or store operations that come after the MFENCE. It effectively draws a line in the sand, saying "no memory operation that comes after this point can be completed until all memory operations before this point are done."
- Impact on rdtsc: The rdtsc instruction reads the time-stamp counter, which counts the number of cycles since the processor was reset. It's a non-serializing instruction, which means it doesn't by itself ensure that previous instructions have completed. By placing an MFENCE before rdtsc, you can ensure that all prior memory operations have been fully carried out before the cycle count is read.
- Preventing Speculative Execution: Speculative execution is another form of out-of-order execution where the processor guesses the path of execution and begins executing ahead. If the guess is incorrect, the results are discarded. MFENCE prevents the results of any speculative execution that involve memory operations from being committed until all previous memory operations are confirmed to be completed.
Student's t-test 用於估計小樣本的未知標準差的常態分布之期望值,若樣本數大則可使用中央極限定理求其母體參數;該論文使用 Welch’s t-test 作為檢定方法,其基於 Student's t-test。
比較 dudect/fixture.c
可得知並未實作 prepare_percentiles()
,這可能導致不符合 Welch’s t-test 假設;由於其假設樣本來自於常態分布,而執行時間則是左偏態(執行時間短的任務佔大多數),故我們需要捨棄左側一部份樣本來達到常態分布假設。
在 dudect/fixture.c
中,可以找到 update_statistics()
,比對 dudect.h
,缺乏移除左側長尾的部分
-static void update_statistics(const int64_t *exec_times, uint8_t *classes)
+static void update_statistics(const int64_t *exec_times,
+ uint8_t *classes,
+ const int64_t *percentiles)
{
for (size_t i = 0; i < N_MEASURES; i++) {
int64_t difference = exec_times[i];
@@ -74,6 +109,14 @@ static void update_statistics(const int64_t *exec_times, uint8_t *classes)
/* do a t-test on the execution time */
t_push(t, difference, classes[i]);
+
+ // t-test on cropped execution times, for several cropping thresholds.
+ for (size_t crop_index = 0; crop_index < DUDECT_NUMBER_PERCENTILES;
+ crop_index++) {
+ if (difference < percentiles[crop_index]) {
+ t_push(t, difference, classes[i]);
+ }
+ }
}
}
在加入這部分後,線上 CI 通過,達到 100/100,見 fix_dudect。
指出論文和程式碼實作的出入之處。
Here are the key reasons why in-place operations hold significant importance:
- Limited Memory Resources
- Performance Optimization
- Real-time Constraints
- Atomicity and Synchronization
- Reducing Fragmentation
$ make
Fatal: GitHub Actions MUST be activated.
Check this article: https://docs.github.com/en/actions/managing-workflow-runs/disabling-and-enabling-a-workflow
Then, execute 'make' again.
make: *** [Makefile:34: .git/hooks/applied] Error 1
留意作業說明!
下棋的時候,我們是先在腦中知道可以走的走法、試誤,譬如炮二平五、俥五進六,發現某些棋對局勢更好 (更有獲勝的機會、想像不到對手有好的走法),我們就選擇這樣的走法;對手根據我們的走法,也同樣選擇,若他的選擇在我們的估計內,我們會發現勝率越來越高,最終獲勝。
參考上圖,各項流程的意義是 1. Simulation
基於目前的局勢猜測勝率,此處實作是雙方隨機落子,即蒙地卡羅模擬法,僅有 1.0
0.0
0.5
三種離散結果。
2. Backpropagation
依照隨機模擬的結果,更新某一個落子的勝率以及被選到的次數;前一輪被選中的落子,會在下一輪才被 Bcakpropagation 更新勝率。
3. Selection
根據 UCT(upper confidence bound applied to trees) 「真正」選擇下一步,在 Backpropagation 當中預估的勝率越大,前項越大 (exoloitation);在先前被選中的次數越少,後項越大,鼓勵選取未曾探索過的落子 (exploration),兩者互相拮抗;實作中設定為:
依照上式,我們可以推論 a. 當使用該節點在模擬中獲得更好的分數,越可能再次選擇它向下模擬 b. 當該節點的親代節點越常被使用,我們越可能再次選擇它向下模擬 c. 當該節點被選擇的次數太多,我們盡量避免使用它向下模擬
4. Expansion
如果某一個落子之後的結果還沒有被預估(葉節點),則將目前合法的落子初始化成他的子節點
( 如果這樣落子,會有多少勝率 (Simulation and Backpropagation) -> 思考這樣落子後,後續的發展 (Expansion) )
定點數就是將某一正數放大後的表示方法,縮放係數 (scalar factor) 係指將正數轉換成定點數需要的乘數;如 12.34 在縮放係數為
我們可以將十進位轉為二進位來觀察
0b1100.01010111000010100011110101110000101000111101011100001
3159.04:0b110001010111.000010100011110101110000101000111101011100001
可以觀察到,3159.04之定點數是將 12.34 向左位移 8 個位元,即
0b0.10000000010.1000101011100001010001111010111000010100011110101110
3159.04: 0b0.10000001010.1000101011100001010001111010111000010100011110101110
而 IEEE 754 中,sign 與 mantissa 相同,而雙精度下指數項之 bias 為
如果只是 0.5 呢?因為是
會發現,當僅有小數時,bias 是用於被扣除的!僅有小數時,可以儲存更小精度的位數,故 IEEE 754 的值域才能夠達
還有簡化比較,若沒有加上 bias,以上述狀況,指數項是負的,我們必須要在指數項引入二補數,直接比較會導致負數大於正數,譬如
在決定縮放係數前,我們首先觀察 score
之值域,以求至少在儲存浮點數以及其四則運算中誤差最小;去除掉第一次拜訪所得 DBL_MAX
之值,最大值為 3.552684,最小值為 0。
當縮放係數為
當縮放係數為 =
當縮放係數為 =
雖然採用越大的縮放係數能夠逼近更小的位數,使得損失降低,但並非越大越好;接下來我們展示定點數的乘除,就能發現縮放係數的大小事實上是一種取捨。
定點數乘法的問題在於容易上溢位 (overflow),假設縮放係數為
double a = 2.25;
double b = 5;
printf("%f\n", a * b);
// fixed point number
unsigned int c = a * (1 << FIXED_POINT_SCALE);
unsigned int d = b * (1 << FIXED_POINT_SCALE);
printf("%f\n", (double) (c * d) / pow(2, FIXED_POINT_SCALE * 2));
當 FIXED_POINT_SCALE = 14
時,兩者相等;當 FIXED_POINT_SCALE = 15
時,定點數乘法計算為 3.25
,是因為左移過程中有兩個位元溢出,故與原先值差為 8
;FIXED_POINT_SCALE = 16
時,計算結果為 0.25
,亦是溢出兩個位元。
故設計不符需求的縮放係數可能會導致上溢位;而我們可以檢測當兩乘數之最高位元相加,當大於系統位元數,就必須處理上溢位。
定點數除法則會導致下溢位 (underflow),除了上溢位的反例,即過小的縮放係數會導致小數位的損失外,當兩數的最高位數相近,或除數大於被除數,也會導致下溢位;我們可以先放大被除數來避免,如
static unsigned int fixed_point_divide(unsigned int a, unsigned int b)
{
unsigned int result;
int a_ls = __fls(a);
int b_ls = __fls(b);
if ((b > a || b_ls - a_ls >= FIXED_POINT_SCALE) && a_ls + FIXED_POINT_SCALE < 32)
result = (a << FIXED_POINT_SCALE) / b;
else
result = (a / b) << FIXED_POINT_SCALE;
return result;
}
int main()
{
double a = 123;
double b = 45;
printf("%f\n", a / b); // 2.733333
unsigned int c = a * (1 << FIXED_POINT_SCALE);
unsigned int d = b * (1 << FIXED_POINT_SCALE);
double r1 = c / d;
double r2 = fixed_point_divide(c, d);
printf("%f\n", (double) r1); // 2.000000
printf("%f\n", (double) r2 / (1 << FIXED_POINT_SCALE)); // 2.730469
}
透過 __fls()
得到最高位數進行檢查避免上溢位,以上程式碼可以處理 #define ITERATIONS 100000
的情況下,log(n_total) / n_visits
約在 1.9 附近,遠小於 __fls()
檢查除法中的上溢位,直接將 (a << FIXED_POINT_SCALE) / b
作為除法結果。
牛頓勘根法的一般式可寫作
故我們將
#define FIXED_POINT_SCALE 8
#define TOLERANCE (1 << (FIXED_POINT_SCALE / 2))
static inline unsigned int fixed_point_divide(unsigned int a, unsigned int b)
{
return (a << FIXED_POINT_SCALE) / b;
}
static inline unsigned int fixed_point_sqrt(unsigned int x)
{
unsigned int r = x > 1 ? (x >> 1) : 1;
unsigned int r_new;
while (1) {
if (r == 0)
return r;
unsigned int div = fixed_point_divide(x, r);
r_new = (r + div) >> 1;
unsigned int diff = r_new - r;
if (diff < TOLERANCE) {
return r_new;
}
r = r_new;
}
}
int main()
{
double max_diff = 0;
double sum_diff = 0;
double std_diff = 0;
double *diff_arr = malloc(65535 * sizeof(double));
for (int a = 1; a < 65536; a++) {
float sqrt_2 = sqrt(a);
unsigned int approx_sqrt_2 = fixed_point_sqrt(a * (1 << FIXED_POINT_SCALE));
double approx_sqrt_2_double = fixed_point_to_double(approx_sqrt_2, FIXED_POINT_SCALE);
double diff = fabs(sqrt_2 - approx_sqrt_2_double);
diff_arr[a - 1] = diff;
sum_diff += diff;
if (diff > max_diff) {
max_diff = diff;
}
}
double mean_diff = sum_diff / 65535;
for (int i = 0; i < 65535; i++) {
std_diff += pow(diff_arr[i] - mean_diff, 2);
}
std_diff = sqrt(std_diff / 65535);
printf("%f, %f, %f\n", max_diff, mean_diff, std_diff);
}
經以上實驗,牛頓勘根法的近似值與浮點數 sqrt
之最大差值為 0.003906,其平均值為 0.001957,標準差為 0.001132。
礙於顯示效果,我們每 100 個數才作圖:
由於差異極小,很難辨別誤差,但確實展示出極佳的逼近效果。
依據上述方法,令
static inline unsigned int fixed_point_log(unsigned int x)
{
if (x == 1)
return 0;
unsigned int l = x > 1 ? (x >> 1) : 1;
unsigned int l_new;
while (1) {
if (l == 0)
return l;
unsigned int fx0 = pow(2, (double) l / (1 << FIXED_POINT_SCALE)) * (1 << FIXED_POINT_SCALE);
unsigned int div = fixed_point_divide(x - fx0, fixed_point_multiply(fx0, LN2));
l_new = l + div;
unsigned int diff = l_new - l;
if (diff < 1) {
return l_new;
}
l = l_new;
}
}
在實作時發現重大問題是,當 pow()
;參考 marvin0102,應引入尤拉公式。
主要將 uct_score
改為定點數運算:
static inline unsigned int uct_score(unsigned int n_total, unsigned int n_visits, unsigned int score)
{
if (n_visits == 0)
- return DBL_MAX;
+ return UINT_MAX;
- return score / n_visits +
- EXPLORATION_FACTOR * sqrt(log(n_total) / n_visits);
+ unsigned int exploitation = fixed_point_divide(score, n_visits);
+ unsigned int exploration = fixed_point_sqrt(fixed_point_divide(fixed_point_log(n_total), n_visits));
+ return exploitation + EXPLORATION_FACTOR * exploration;
}
與原先 uct_score
之值相減後計算出誤差,作圖如下
如 lab0,新增 ai_vs_ai
參數,並使用 do_ttt
包裝 ttt
函數,將控制參數傳入即可,見 commit 7d11da4。
惟為了使用 function pointer 使程式更簡潔,將 negamax_predict
傳回類型改為整數類,因目前實作中暫不需要 move_t
;且三者皆改為傳入 (char *, char *)
,隨後再轉型回各自之型態。
在實作 coroutine 前,係直接取代 get_input
,而直接導向原先電腦演算法,此時已經可以是 "電腦 v.s. 電腦" 對弈。
- move = get_input(turn);
- if (table[move] == ' ') {
+ if (ai_vs_ai)
+#ifdef USE_RL
+ move = ai_algorithm(table, &agent);
+#else
+ move = ai_algorithm(table, &turn);
+#endif
+ else
+ move = get_input(turn);
+ if (table[move] == ' ')
break;
- }
考慮 concurrent-programs/coro/coro.c