linux2023: Cuda-Chen

# linux2023 hw1: Cuda-Chen > [題目](https://hackmd.io/@sysprog/linux2023-summer-quiz0) > [GitHub](https://github.com/Cuda-Chen/linux-2023-summer/tree/main/hw1) ## 測驗 α ### 測驗 α − 1 AAAA 到 EEEE 部份，其位於 `st_remove()` 函式。該函式如下： ```c void st_remove(struct st_node **root, struct st_node *del) { if (st_right(del)) { struct st_node *least = st_first(st_right(del)); if (del == *root) *root = least; AAAA; BBBB; return; } if (st_left(del)) { struct st_node *most = st_last(st_left(del)); if (del == *root) *root = most; CCCC; DDDD; return; } if (del == *root) { *root = 0; return; } /* empty node */ struct st_node *parent = st_parent(del); if (st_left(parent) == del) st_left(parent) = 0; else st_right(parent) = 0; EEEE; } ``` 根據函式註解，在 remove 一個 node 後，為了維護 S-tree 的 binary search tree 之性質，應該將即將被 remove 之 node 和其 left/right child 進行 replace（replace the node being remove with its left/right child）。最後，記得更新其 parent node。所以： - `AAAA: st_replace_right(del, least)` - `BBBB: st_update(root, least)` - `CCCC: st_replace_left(del, most)` - `DDDD: st_update(root, most)` - `EEEE: st_update(root, parent)` FFFF 以及 GGGG 部份，其位於 `__treeint_dump()` 函式。該函式如下： ```c static void __treeint_dump(struct st_node *n, int depth) { if (!n) return; __treeint_dump(FFFF, depth + 1); struct treeint *v = treeint_entry(n); printf("%d\n", v->value); __treeint_dump(GGGG, depth + 1); } ``` 可以看出該函式要印出一個升冪排序，對於一個 binary search tree，印出升冪排序應該使用 inorder traversal 進行 traverse。所以： - `FFFF: st_left(n)` - `GGGG: st_right(n)` ### 測驗 α − 2 待補目前在 [ecf4580](https://github.com/Cuda-Chen/linux-2023-summer/commit/ecf4580b72499ffb006e6d7679bd90f510e48b21) 新增 rbtree 的測試，不過目前在 `#include <linux/rbtree.h>` 會遇到 `header not found error`（即便在 complie 給 `-I/usr/include` 也會遇到 error） test cases: - random insert/find/remove - sequential insert/find/remove ### 測驗 α − 3 待補可以根據以下的方式建立一個 skewed binary search tree： - 給一個陣列：`[1, 2, 3, 4, 5, 6]` - 以 idx 順序輸入 - idx +2 -> -1 -> +2 -> -1 -> ... 的順去進行插入 ## 測驗 β - `HHHH: ((sz + mask) & ~mask)` ### 測驗 β − 1 - 給予一個變數 sz 之記憶體對齊位址。 - 當 alignment 為 power of two 時，使用 bit operation 來加速記憶體對齊位址之計算。 ### 測驗 β − 2 - https://elixir.bootlin.com/linux/latest/A/ident/align_up - 於 `v6.4.8` 的 kvm unit test 中可以看到使用 `align_up` 以計算記憶體對齊位址： ```c // https://elixir.bootlin.com/linux/v6.4.8/source/tools/testing/selftests/kvm/include/test_util.h#L158 static inline uint64_t align_down(uint64_t x, uint64_t size) { uint64_t x_aligned_up = align_up(x, size); if (x == x_aligned_up) return x; else return x_aligned_up - size; } ``` ## 測驗 γ <s> > 題目：https://gist.github.com/jserv/38bca4ebfea1c434e1dfc15337f80eeb - `HHHH: pthread_cond_wait(&qs->cond_st, &qs->mtx_st)` - `JJJJ: pthread_cond_signal(&qs2->cond_st)` </s> ### 測驗 γ − 1 觀察 `HHHH` 以及 `JJJJ` 所位於之程式碼區段： ```c again: /* 用來確認 thread 是否可以開始執行 */ verify(pthread_mutex_lock(&qs->mtx_st)); while (qs->st == ts_idle) verify(HHHH); verify(pthread_mutex_unlock(&qs->mtx_st)); if (qs->st == ts_term) { return NULL; } assert(qs->st == ts_work); /* 執行 quick sort */ qsort_algo(qs); /* 用來從 thread pool 中挑選 thread 來執行 qsort */ verify(pthread_mutex_lock(&c->mtx_al)); qs->st = ts_idle; c->idlethreads++; if (c->idlethreads == c->nthreads) { for (i = 0; i < c->nthreads; i++) { qs2 = &c->pool[i]; if (qs2 == qs) continue; verify(pthread_mutex_lock(&qs2->mtx_st)); qs2->st = ts_term; verify(JJJJ); verify(pthread_mutex_unlock(&qs2->mtx_st)); } verify(pthread_mutex_unlock(&c->mtx_al)); return NULL; } verify(pthread_mutex_unlock(&c->mtx_al)); goto again; ``` ### 測驗 γ − 2 > 使用如下命令以啟用 Thread Sanitizer： > `gcc -Wall -o qsort-mt gamma_solution.c -lpthread -fsanitize=thread -fPIE -pie -g` 使用 Thread Sanitizer 後可以看到當 n 足夠大時會出現 data race： ```bash $ ./qsort_mt -n 1000 ================== WARNING: ThreadSanitizer: data race (pid=12354) Write of size 4 at 0x7b4000000080 by thread T1 (mutexes: write M0): #0 allocate_thread /home/jio/c_code/linux-2023-summer/hw1/gamma_solution.c:214 (qsort_mt+0x3476) #1 qsort_algo /home/jio/c_code/linux-2023-summer/hw1/gamma_solution.c:317 (qsort_mt+0x4193) #2 qsort_thread /home/jio/c_code/linux-2023-summer/hw1/gamma_solution.c:354 (qsort_mt+0x45b1) Previous read of size 4 at 0x7b4000000080 by thread T2 (mutexes: write M2): #0 qsort_thread /home/jio/c_code/linux-2023-summer/hw1/gamma_solution.c:346 (qsort_mt+0x44c7) Location is heap block of size 256 at 0x7b4000000000 allocated by main thread: #0 calloc ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:672 (libtsan.so.0+0x31edc) #1 qsort_mt /home/jio/c_code/linux-2023-summer/hw1/gamma_solution.c:135 (qsort_mt+0x28fa) #2 main /home/jio/c_code/linux-2023-summer/hw1/gamma_solution.c:506 (qsort_mt+0x500e) Mutex M0 (0x7fff4a605078) created at: #0 pthread_mutex_init ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1227 (libtsan.so.0+0x4bee1) #1 qsort_mt /home/jio/c_code/linux-2023-summer/hw1/gamma_solution.c:133 (qsort_mt+0x28dd) #2 main /home/jio/c_code/linux-2023-summer/hw1/gamma_solution.c:506 (qsort_mt+0x500e) Mutex M2 (0x7b40000000a8) created at: #0 pthread_mutex_init ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:1227 (libtsan.so.0+0x4bee1) #1 qsort_mt /home/jio/c_code/linux-2023-summer/hw1/gamma_solution.c:139 (qsort_mt+0x297f) #2 main /home/jio/c_code/linux-2023-summer/hw1/gamma_solution.c:506 (qsort_mt+0x500e) Thread T1 (tid=12356, running) created by main thread at: #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:969 (libtsan.so.0+0x605b8) #1 qsort_mt /home/jio/c_code/linux-2023-summer/hw1/gamma_solution.c:147 (qsort_mt+0x2a8e) #2 main /home/jio/c_code/linux-2023-summer/hw1/gamma_solution.c:506 (qsort_mt+0x500e) Thread T2 (tid=12357, running) created by main thread at: #0 pthread_create ../../../../src/libsanitizer/tsan/tsan_interceptors_posix.cpp:969 (libtsan.so.0+0x605b8) #1 qsort_mt /home/jio/c_code/linux-2023-summer/hw1/gamma_solution.c:147 (qsort_mt+0x2a8e) #2 main /home/jio/c_code/linux-2023-summer/hw1/gamma_solution.c:506 (qsort_mt+0x500e) SUMMARY: ThreadSanitizer: data race /home/jio/c_code/linux-2023-summer/hw1/gamma_solution.c:214 in allocate_thread ================== ThreadSanitizer: reported 1 warnings ``` 可以看到如下的程式碼（註解所示）發生 data race： ```c static struct qsort *allocate_thread(struct common *c) { verify(pthread_mutex_lock(&c->mtx_al)); for (int i = 0; i < c->nthreads; i++) if (c->pool[i].st == ts_idle) { c->idlethreads--; c->pool[i].st = ts_work; // <-- 這裡發生 data race verify(pthread_mutex_lock(&c->pool[i].mtx_st)); verify(pthread_mutex_unlock(&c->mtx_al)); return (&c->pool[i]); } verify(pthread_mutex_unlock(&c->mtx_al)); return (NULL); } ``` 在變更 thread 的 status 時，我們應該在 critical section 進行變更。因此程式碼應該修改如下： ```c static struct qsort *allocate_thread(struct common *c) { verify(pthread_mutex_lock(&c->mtx_al)); for (int i = 0; i < c->nthreads; i++) if (c->pool[i].st == ts_idle) { c->idlethreads--; //c->pool[i].st = ts_work; // <-- 這裡發生 data race verify(pthread_mutex_lock(&c->pool[i].mtx_st)); c->pool[i].st = ts_work; // <-- 移動到這裡 verify(pthread_mutex_unlock(&c->mtx_al)); return (&c->pool[i]); } verify(pthread_mutex_unlock(&c->mtx_al)); return (NULL); } ``` ### 測驗 γ − 3 [15471da](https://github.com/Cuda-Chen/linux-2023-summer/commit/15471da8452e1512d565adb654cc47da00713d6e) 先使用 `perf stat` 來看我們哪裡可以進行 optimization： ``` $ perf stat --repeat 5 ./qsort-mt Performance counter stats for './qsort-mt' (5 runs): 3,846.71 msec task-clock # 1.811 CPUs utilized ( +- 0.08% ) 67 context-switches # 17.403 /sec ( +- 2.24% ) 6 cpu-migrations # 1.558 /sec ( +- 15.81% ) 9,830 page-faults # 2.553 K/sec ( +- 0.01% ) 11,549,247,140 cycles # 3.000 GHz ( +- 0.06% ) 4,943,291,732 stalled-cycles-frontend # 42.79% frontend cycles idle ( +- 0.24% ) 13,990,699,557 instructions # 1.21 insn per cycle # 0.35 stalled cycles per insn ( +- 0.00% ) 1,909,353,615 branches # 495.939 M/sec ( +- 0.00% ) 108,735,936 branch-misses # 5.69% of all branches ( +- 0.03% ) 2.12391 +- 0.00267 seconds time elapsed ( +- 0.13% ) ``` 再來使用 `perf record -e stalled-cycles-frontend:u,instructions:u,cycles:u,branches:u` 來看 hotspots： ``` # stalled-cycles-frontend:u 52.31% qsort-mt qsort-mt [.] qsort_algo ▒ 26.38% qsort-mt qsort-mt [.] swapfunc ▒ 12.38% qsort-mt qsort-mt [.] num_compare ◆ 4.04% qsort-mt libc.so.6 [.] __random ▒ 1.82% qsort-mt qsort-mt [.] main ▒ 0.90% qsort-mt qsort-mt [.] med3 ``` 看到 `qsort_algo` 是個佔比最多的 hotspot，所以使用 annotate 來看看哪個 code 最常被用到： ```c │ while (pb <= pc && (r = CMP(thunk, pb, a)) <= 0) { ▒ 2.26 │478: mov -0xb0(%rbp),%rax ▒ 3.82 │ cmp -0xa8(%rbp),%rax ▒ │ ↓ ja 54b ▒ 3.31 │ mov -0x80(%rbp),%rdx ▒ 0.02 │ mov -0xb0(%rbp),%rax ▒ 0.05 │ mov -0x60(%rbp),%rcx ▒ 0.75 │ mov %rdx,%rsi ▒ 0.57 │ mov %rax,%rdi ▒ │ → call *%rcx ▒ 0.01 │ mov %eax,-0xcc(%rbp) ▒ 1.65 │ cmpl $0x0,-0xcc(%rbp) ▒ 10.77 │ ↑ jle 3ee ▒ │ } ``` 可以看到在 swap 部份是佔有比例最多的。 [b1ad756](https://github.com/Cuda-Chen/linux-2023-summer/commit/b1ad75610b1cb39c6040f85260db012795e907dd) 不過因為亂數生成是使用 `rand()`，考慮原本未排序資料過於固定，會影響實驗結果。參考[2021年作業-ksort](https://hackmd.io/@sysprog/linux2021-sort) 後，選擇 [Xoroshiro128+](https://en.wikipedia.org/wiki/Xoroshiro128%2B) 讓生成的亂數更亂，之後再跑`perf stat`： ``` $ perf stat --repeat 5 ./qsort-mt Performance counter stats for './qsort-mt' (5 runs): 3,914.86 msec task-clock # 1.857 CPUs utilized ( +- 0.11% ) 74 context-switches # 18.899 /sec ( +- 25.97% ) 19 cpu-migrations # 4.852 /sec ( +- 12.23% ) 9,836 page-faults # 2.512 K/sec ( +- 0.01% ) 11,737,718,399 cycles # 2.998 GHz ( +- 0.12% ) 5,255,276,215 stalled-cycles-frontend # 44.81% frontend cycles idle ( +- 0.32% ) 14,426,760,095 instructions # 1.23 insn per cycle # 0.36 stalled cycles per insn ( +- 0.00% ) 1,875,724,083 branches # 479.038 M/sec ( +- 0.00% ) 109,678,859 branch-misses # 5.85% of all branches ( +- 0.11% ) 2.10864 +- 0.00409 seconds time elapsed ( +- 0.19% ) ``` 我們也來測試在不同大小的 input 的執行時間： | N | time (measure in seconds) | | -------- | -------- | | 10^2 | ~0 | | 10^3 | 0.00252 | | 10^4 | ~0 | | 10^6 | 0.35 | | 10^7 | 3.86 | | 10^9 | 506 | 不免俗的，使用 `perf record -e stalled-cycles-frontend:u,instructions:u,cycles:u,branches:u` 來看 hotspots： ``` # stalled-cycles-frontend:u 54.81% qsort-mt qsort-mt [.] qsort_algo 26.06% qsort-mt qsort-mt [.] swapfunc 13.71% qsort-mt qsort-mt [.] num_compare 1.27% qsort-mt qsort-mt [.] main 0.98% qsort-mt qsort-mt [.] next 0.81% qsort-mt qsort-mt [.] med3 0.64% qsort-mt qsort-mt [.] rotl ``` 使用 annotate 來看看哪個是不是 swap 部份仍然最常被用到： ```c │ while (pb <= pc && (r = CMP(thunk, pb, a)) <= 0) { ▒ 1.90 │478: mov -0xb0(%rbp),%rax ▒ 2.75 │ cmp -0xa8(%rbp),%rax ▒ │ ↓ ja 54b ▒ 3.65 │ mov -0x80(%rbp),%rdx ▒ 0.03 │ mov -0xb0(%rbp),%rax ▒ 0.06 │ mov -0x60(%rbp),%rcx ▒ 0.49 │ mov %rdx,%rsi ▒ 0.68 │ mov %rax,%rdi ▒ 0.01 │ → call *%rcx ▒ 0.04 │ mov %eax,-0xcc(%rbp) ◆ 1.41 │ cmpl $0x0,-0xcc(%rbp) ▒ 9.80 │ ↑ jle 3ee ``` 目前看下來在 swap 其中的 `while (pb <= pc && (r = CMP(thunk, pb, a)) <= 0)` 最常被用到。我們可以從減少 swap 次數的思路出發進行最佳化。（目前發現使用任何 sorting 配合使用 Xoroshiro128+ 會導致結果不正確（例如：-87 > 1000）） ~~在 [專題: lib/sort.c](https://hackmd.io/@sysprog/B146OVeHn) 中，我考慮使用 hybird sort 的 intro sort 進行效能改進的實做。~~ 我認為應該從 quick sort partition method 先進行 optimize（題目給的程式碼是使用 quick sort w/ Hoare partition），在以下兩篇文章提到可以使用 branchless Lomutor partition 來嘗試提昇效率： - [Hoare’s Rebuttal and Bubble Sort’s Comeback](https://blog.reverberate.org/2020/05/29/hoares-rebuttal-bubble-sorts-comeback.html) - 這篇提到的 optimized bubble sort 也可以嘗試 - [Lomuto’s Comeback](https://dlang.org/blog/2020/05/14/lomutos-comeback/) 或者也可以改成 lock-free thread pool。 [fcb4130](https://github.com/Cuda-Chen/linux-2023-summer/commit/fcb4130e850466feeedd3d2cb7ae0693a48c47e9) [e591bce](https://github.com/Cuda-Chen/linux-2023-summer/commit/e591bcee15efad63467406f4284a19232f3ecad9) 目前加入 optimized bubble sort，不過結果不正確，正在 debug。 [6ed1471](https://github.com/Cuda-Chen/linux-2023-summer/commit/6ed1471ab2489b3bef22acc40dc47eb90f8f26c6) 加入 slightly-optimized bubble sort 之後，在某些 input size 的執行時間有減少一些。 | N | time (measure in seconds) | | -------- | -------- | | 10^2 | 0.00209 | | 10^3 | 0.0018 | | 10^4 | 0 | | 10^6 | 0.353 | | 10^7 | 4.03 | | 10^9 | 495 | [2b61777](https://github.com/Cuda-Chen/linux-2023-summer/commit/2b61777b9037ec2cebafabdbff70f56cf5e1cd16) 使用 optimized bubble sort，不過執行時間在所有情境並沒有減少。 | N | time (measure in seconds) | | -------- | -------- | | 10^2 | ~0 | | 10^3 | 0.00213 | | 10^4 | 0.00509 | | 10^6 | 0.356 | | 10^7 | 4.07 | | 10^9 | 517 | ## MISC ### alpha - https://www.facebook.com/groups/system.software2023/permalink/709222697696554/ - https://hackmd.io/@sysprog/linux-rbtree - https://hackmd.io/Yai5h5nOTcyfeCU8GuDmng - https://github.com/Zimilo/kernel-dev-examples/blob/master/rbtree_example/example.c ### gamma - https://opensource.apple.com/source/xnu/xnu-1456.1.26/bsd/kern/qsort.c - https://juejin.cn/s/c%E8%AF%AD%E8%A8%80qsort%E5%87%BD%E6%95%B0%E6%BA%90%E7%A0%81 - generic bubble sort - https://github.com/Bidzina/Generic-Bubble-Sort-in-C/blob/master/main.c