linux2023 hw2: Cuda-Chen

# 2023 Homework2 (quiz1) contributed by < `Cuda-Chen` > [2023 summer quiz1 題目](https://hackmd.io/@sysprog/linux2023-summer-quiz1) [source code hosted on GitHub](https://github.com/Cuda-Chen/linux-2023-summer/tree/main/hw2) ### futex 介紹在 concurrency programming 中，我們可以使用 mutux + cond var 來防止 race condition。然此種作法會需要兩個 wait 加上兩個 wake ，並且以上四次動作需要 syscall，造成效能浪費。為了增進效能，Linux 提供 futex（fast userspace mutex），讓相關的操作可以在 userspace 進行，減少 syscall 的次數；亦提供 requeue 功能，將二個佇列之間的執行緒縮減為一次呼叫，減少競爭 mutex lock 的次數，進而提昇效能。 ## 測驗 `1` ### 測驗 `1` - 1 #### 測試方法（待補） #### 運作原理這個程式是計算於 coputation graph 中每個 node 之執行時間。需要修改之地方分佈在 `cond.h` 以及 `mutex.h` 中，所以我們接下來看這兩個檔案。看看 `AAAA` ~ `EEEE` 所在程式碼部份： ```c // cond.h /* 用以等待 cond var 是否已到特定條件（`cond->seq`） */ static inline void cond_wait(cond_t *cond, mutex_t *mutex) { int seq = load(&cond->seq, relaxed); mutex_unlock(mutex); #define COND_SPINS 128 for (int i = 0; i < COND_SPINS; ++i) { if (load(&cond->seq, relaxed) != seq) { mutex_lock(mutex); return; } spin_hint(); } futex_wait(&cond->seq, seq); mutex_lock(mutex); fetch_or(&mutex->state, MUTEX_SLEEPING, relaxed); // AAAA } /* 發出 signal 以通知其他 thread */ static inline void cond_signal(cond_t *cond, mutex_t *mutex) { fetch_add(&cond->seq, 1, relaxed); // BBBB futex_wake(&cond->seq, 1); // EEEE } /* 發出 broadcast 給其他 thread */ static inline void cond_broadcast(cond_t *cond, mutex_t *mutex) { fetch_add(&cond->seq, 1 relaxed); // CCCC futex_requeue(&cond->seq, 1, &mutex->state); // DDDD } ``` ```c // mutex.h /* 將 mutex unlock，並且如果 mutex 是 MUTEX_SLEEPING，發出 wake signal */ static inline void mutex_unlock(mutex_t *mutex) { int state = exchange(&mutex->state, 0, release); if (state & MUTEX_SLEEPING) futex_wake(&mutex->state, 1); // FFFF } ``` #### futex 使用方式根據 [futex(2)](https://man7.org/linux/man-pages/man2/futex.2.html) 提到的用法如下： ```c #include <linux/futex.h> /* Definition of FUTEX_* constants */ #include <sys/syscall.h> /* Definition of SYS_* constants */ #include <unistd.h> long syscall( SYS_futex, uint32_t *uaddr, int futex_op, uint32_t val, const struct timespec *timeout, /* or: uint32_t val2 */ uint32_t *uaddr2, uint32_t val3 ); ``` 由於 `futex()` 沒有提供 wrapper，所以使用的時候需要使用 `syscall()`。我們可以參考作業中的以下兩個函式來達成 mutex 想要的 wait 以及 wake： ```c /* Atomically check if '*futex == value', and if so, go to sleep */ static inline void futex_wait(atomic int *futex, int value) { syscall(SYS_futex, futex, FUTEX_WAIT_PRIVATE, value, NULL); } /* Wake up 'limit' threads currently waiting on 'futex' */ static inline void futex_wake(atomic int *futex, int limit) { syscall(SYS_futex, futex, FUTEX_WAKE_PRIVATE, limit); } ``` #### Linux 核心開發者針對 POSIX Threads 強化哪些相關 futex 實作機制 futex 目前在有以下問題（摘自 [The futex_waitv() syscall and gaming on Linux](https://www.collabora.com/news-and-blog/blog/2023/02/17/the-futex-waitv-syscall-gaming-on-linux/)）： > 1. The futex_wait() can only be done one one futex at a time. > 2. The futex syscall interface is complex because of the multiplexing of plenty operations. > 3. Only the futex word of size 32-bit is supported. > 4. The implementation of the futex isn't NUMA aware. 為此，[[PATCH v5 00/11] Add futex2 syscalls](https://lore.kernel.org/lkml/20210709001328.329716-1-andrealmeid@collabora.com/) 中提到 Collabora 為了解決以上問題，提出兩個作法： 1. 新增 `futex_waitv()` syscall 2. 讓 futex 大小可以任意調節 ##### 新增 `futex_waitv()` syscall > The futex_wait() can only be done one one futex at a time. 為了解決這個問題，我們可以使用 `eventfd()` syscall 來等待多個 mutex，然而當 mutex 數量到一定程度時，`eventfd()` 會變成 bottleneck。因此，Collabora 引進 `futex_waitv()` syscall。`futex_waitv()` 可以對一個 futex array 進行等待。 ##### 讓 futex 大小可以任意調節 futex 的大小固定是 32-bit，然而此固定大小除了不能容納更多的 debug messages 外，也會使 futex 於 NUMA 的表現產生影響。所以 Collabora 讓 futex 的大小可以根據工程師的需求進行大小調整，其應用情境如下： 1. futex 大小調整為 64-bit，用以容納更多 debug messages 2. futex 大小調整為 8-bit，在 NUMA 時可以減少資料傳輸，提昇效能 ### 測驗 `1` - 2 [b2fd20a](https://github.com/Cuda-Chen/linux-2023-summer/commit/b2fd20a4c1766566729f05b74cb1e7492758beff) 主要將原本的 `pthread_mutex_xxx` 以及 `pthread_cond_xxx` 分別改成 `mutex_xxx` 以及 `cond_xxx`（有例外是 `pthread_mutex_destroy` 以及 `pthread_cond_destroy`，遇到這兩個直接刪掉即可）執行時間如下表（可以看出跟 [作業一](https://hackmd.io/@gb_c16rKTiu-nweuRXXVNg/linux2023-summer-quiz?view#%E6%B8%AC%E9%A9%97-%CE%B3-%E2%88%92-3) 相比，當 N（input size）越來越大，執行時間有縮短）： | N | time (measure in seconds) | | -------- | -------- | | 10^2 | 0.00374 | | 10^3 | 0.00629 | | 10^4 | 0.00891 | | 10^6 | 0.218 | | 10^7 | 2 | | 10^9 | 279 | ### 測驗 `1` - 3 可能可以從 futex2 開始。執行時要用 `taskset` 限制在一個 CPU 上執行，且指定 priority 要記得使用 `sudo` [a9b00e7](https://github.com/Cuda-Chen/linux-2023-summer/commit/a9b00e77448749221a3fcf70c19b598f5529a5a8) 首先我們可以從 [實作 priority inversion](https://hackmd.io/@sysprog/concurrency-thread-package#%E5%AF%A6%E4%BD%9C-priority-inversion) 中建立三個擁有 priority 的 process 以進行實驗。分配 priority 以及 task 內容如下： ```c void *task1(void *arg) { mutex_lock(&mtx1); printf("1\n"); mutex_unlock(&mtx1); return NULL; } void *task2(void *arg) { mutex_lock(&mtx2); sleep(1); printf("2\n"); mutex_unlock(&mtx2); return NULL; } void *task3(void *arg) { mutex_lock(&mtx1); sleep(1); printf("3\n"); mutex_unlock(&mtx1); return NULL; } param.sched_priority = (THREADCNT - 2) * 10; pthread_attr_setschedparam(&attr, &param); pthread_create(&th[2], NULL, task3, (void *)NULL); param.sched_priority = (THREADCNT - 1) * 10; pthread_attr_setschedparam(&attr, &param); pthread_create(&th[1], NULL, task2, (void *)NULL); param.sched_priority = (THREADCNT) * 10; pthread_attr_setschedparam(&attr, &param); pthread_create(&th[0], NULL, task1, (void *)NULL); ``` 其中，thread 0 有最高的 priority。然而，實驗結果卻顯示 thread 2 先行完成： ``` $ ./pi-test 3 1 2 ``` #### 實做 priority inheritance mutex 參考 [futex(2)](https://man7.org/linux/man-pages/man2/futex.2.html) 中的 priority inheritance mutex 段落進行實做。 [46099fd](https://github.com/Cuda-Chen/linux-2023-summer/commit/46099fd621e59988943b5b0c2d16688b53913baa) 在這個 commit 實做 PI-mutex，不過仍然遇到 priority inversion： ``` $ ./pi-test 3 1 2 ``` ### 測驗 `1` - 4 （待補）首先先把執行 `make` 並且執行三個編譯出來的執行檔： ``` $ ./perf-pthreads Measuring Locking and unlocking without contention: best 8ns, 50%ile 8ns Measuring Locking and unlocking with contention: best 3449ns, 50%ile 5225ns $ ./perf-skinny Measuring Locking and unlocking without contention: best 13ns, 50%ile 13ns Measuring Locking and unlocking with contention: best 4659ns, 50%ile 6178ns $ ./perf-spinlock Measuring Locking and unlocking without contention: best 9ns, 50%ile 9ns Measuring Locking and unlocking with contention: best 97ns, 50%ile 141ns ``` ### code may improve ```c static inline void spin_lock(spinlock_t *lock) { /* Given the specific implementation of spin_trylock(), this results in * a test-and-test-and-set (TTAS) loop. Refer to * https://rigtorp.se/spinlock/ for more information. */ while (!spin_trylock(lock)) spin_hint(); } ``` ## MISC ### WINE - https://www.facebook.com/groups/598186762133482/search/?q=wine - https://hackmd.io/@taipeiTech/H1LzOsFSB/https%3A%2F%2Fhackmd.io%2F%40sysprog%2Flinux-comic#Wine-%E8%A1%8C%E7%A8%8B ### futex - https://eli.thegreenplace.net/2018/basics-of-futexes/ - https://github.com/weirddan455/futex-test - https://lore.kernel.org/lkml/20210709001328.329716-1-andrealmeid@collabora.com/ ### futex for MS gamming - https://www.collabora.com/news-and-blog/blog/2023/02/17/the-futex-waitv-syscall-gaming-on-linux/ - https://www.phoronix.com/news/FUTEX2-v5 ### futex2 - https://docs.kernel.org/userspace-api/futex2.html ### PI-futex - https://hackmd.io/6FOQGjztRx29BnDt2OnzSw?view - https://docs.kernel.org/locking/rt-mutex-design.html - lockdep.c (172K) ### linux workqueue.c - `try_to_get_pending` ### N/A - https://www.hillelwayne.com/post/safety-and-liveness/