2023 Homework2 (quiz1)

# 2023 Homework2 (quiz1) contributed by < `rk42745417` > 測驗作答： 1. `AAAA` = `MUTEX_SLEEPING` 2. `BBBB` = `1` 3. `CCCC` = `1` 4. `DDDD` = `1` 5. `EEEE` = `futex_wake` 6. `FFFF` = `futex_wake` ## Q1 ### Futex 介紹 mutex 和 cond 都是透過 futex 實作，根據 [futex(2)](https://man7.org/linux/man-pages/man2/futex.2.html)，每個 futex 其實就是一個 32 bit 的值，我們用到了其中以下操作： + FUTEX_WAIT：當給定的指標 `uaddr` 指向的值等於 `val` 時，該 thread 會 sleep 於該 futex。 + FUTEX_WAKE：於 sleep 於在 `uaddr` 的 futex 的 threads 中喚起至多 `val` 個 thread。 + 在此，我們總是只會喚醒一個 thread（`val` 為 $1$）。 + FUTEX_REQUEUE：於 sleep 於在 `uaddr` 的 futex 的 threads 中喚起至多 `val` 個 thread，並再把至多 `val2` 個 threads 移到在 `uaddr2` 的 futex 的 wait queue。 + 同 FUTEX_WAKE，我們總是喚醒最多一個 thread，並把剩下的所有 threads 都移走（`val2` 為 `INT_MAX`）。 + 只有在 cond 中有用到 ### 實作細節 cond 和 mutex 其實都只是一個 `int32_t`，代表著一個 futex。 mutex 的值會有 `MUTEX_LOCKED` 和 `MUTEX_SLEEPING` 兩個 bit，分別代表是否上鎖以及是否可能有 thread 正在等待該 mutex。 + `mutex_trylock()` 會嘗試對 mutex 上鎖，但如果已上鎖（`MUTEX_LOCKED`）則會回傳 `false`，否則將其上鎖並回傳 `true`。 + `mutex_lock()` 會先進行 `MUTEX_SPINS` 次的 trylock，如果失敗的話則會使用 `futex_wait` 來讓自己等待於當前的 mutex。`MUTEX_SLEEPING` 此時會被設置。 + `mutex_unlock()` 來把當前 mutex 給歸 0，代表著解鎖。而如果 `MUTEX_SLEEPING` 有被設置，則透過 `futex_wake` 喚醒一個 thread 來讓他 acquire lock。由於被喚醒的 thread 在上鎖時會順便把 `MUTEX_SLEEPING` 給設置，所以在之後 unlock 的時候會重新喚醒其他 thread。 cond 的值會是一個整數，每次 signal 時值都會加上一，代表狀態的改變。 + `cond_wait()` 會先解鎖對應的 mutex，並進行 `COND_SPINS` 次確認，如果 cond 狀態改變，則直接 `mutex_lock`。否則會讓自己等待於當前的 cond，於喚醒後再 `mutex_lock`。 + `cond_signal()` 會把 cond 的值加一，並喚醒一個 thread 來讓他嘗試 acquire lock。 + `cond_broadcast()` 會把 cond 的值加一，並透過 `futex_requeue` 來喚醒一個 thread 然後把剩下的 thread 都移到 mutex 的 wait queue。 + 本質上那些 thread 都已經等到 cond 了，只是接下來要繼續等 mutex（因為 `cond_wait` 在等到後會進行 `mutex_lock`）。透過 requeue，那些 thread 可以跳過中間過程直接進到 mutex 的 wait queue，從而減少喚醒後所造成的 context switch 成本。 + 於 `cond_wait` 的最後會設置 `MUTEX_SLEEPING`，原因是要確保在喚醒的 thread 解鎖 mutex 時，會去喚醒被 requeue 到 mutex 的 wait queue 的那些 threads。整個程式最顯著的地方就是使用 futex 的 requeue 來加速 `cond_broadcast`。 ### 測試方法測試是使用 `main.c`，其維護了 `N_NODES` 個節點，每個節點代表一個 thread，且有一個共用的 `clock`。一開始會有一個 `clock_tick()` 把 tick 從 $0$ 變成 $1$，我們稱這為第 $0$ 次 tick。令 `ctz(i)` 為 `i` 有幾個 tailing zeroes，那第 $i$ 次 tick 會是由第 `ctz(i)` 個節點所發出（因為每個節點會一直反覆進行 `node_signal` 和 `clock_tick`）。例如第 0 個節點會發出第 1, 3, 5, 7, 9, ... 個 tick，而第 1 個節點會發出第 2, 6, 10, 14, ...，第三個節點發出 4, 12, 20, ...，以此類推。最終到第 $2^{N\_NODES} - 1$ 個 tick 發出後，不再有節點發出 tick。每次的 tick 都會發出 broadcast。 ### Linux kernel 根據 [futex(2)](https://man7.org/linux/man-pages/man2/futex.2.html) 的 DESCRIPTION -> Priority-inheritance futexes -> FUTEX_WAIT_REQUEUE_PI 提到： > The FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI were added to support a fairly specific use case: support for priority-inheritance-aware POSIX threads condition variables. The idea is that these operations should always be paired, in order to ensure that user space and the kernel remain in sync. Thus, in the FUTEX_WAIT_REQUEUE_PI operation, the user-space application pre-specifies the target of the requeue that takes place in the FUTEX_CMP_REQUEUE_PI operation. 可知 `FUTEX_WAIT_REQUEUE_PI` 和 `FUTEX_CMP_REQUEUE_PI` 是用來支援 priority-inheritance-aware POSIX threads。 `FUTEX_WAIT_REQUEUE_PI` 把 `wait` 和 `requeue` 綁在了一起。 ## Q2 首先修改 `mutex.h` 和 `cond.h` 使得他們與 pthread API 行為一致： ```c static inline int mutex_init(mutex_t *mutex, void* _) { atomic_init(&mutex->state, 0); return 0; } static inline int mutex_lock(mutex_t *mutex) { ... return 0; } static inline int mutex_unlock(mutex_t *mutex) { ... return 0; } static inline int cond_init(cond_t *cond, void* _) { atomic_init(&cond->seq, 0); return 0; } static inline int cond_wait(cond_t *cond, mutex_t *mutex) { ... return 0; } static inline int cond_signal(cond_t *cond){ ... return 0; } static inline int cond_broadcast(cond_t *cond, mutex_t *mutex) { ... return 0; } ``` 並在原先的實作加上： ```c #include "mutex.h" #include "cond.h" #define pthread_mutex_t mutex_t #define pthread_cond_t cond_t #define pthread_mutex_lock mutex_lock #define pthread_mutex_unlock mutex_unlock #define pthread_cond_signal cond_signal #define pthread_cond_wait cond_wait #define pthread_cond_init cond_init #define pthread_mutex_init mutex_init #define pthread_mutex_destroy(x) (0) #define pthread_cond_destroy(x) (0) ``` 使用以下指令編譯 ```bash $ cc -std=gnu11 -lpthread -DUSE_LINUX -D_GNU_SOURCE 3.c $ ./a.out -t -l 1.18 1.17 0.00998 $ ./a.out -t 0.993 1.78 0 $ ./a.out -t -h 8 0.502 2.21 0.0133 $ ./a.out -t -h 12 0.432 2.36 0.017 ``` 可見其維持了 multithread 的優勢。 ## Q3 ### 測試程式碼以下列程式碼做測試： ```c #include <pthread.h> #include <sched.h> #include <stdio.h> #include "cond.h" #include "mutex.h" mutex_t mut, mut_res; int ok, ok2; volatile int counter; int result[3], result_it; #define SPINS 500000000 void* thread_func(void* arg) { int thread_id = (int) arg; if (thread_id == 0) { mutex_lock(&mut); ok2 = 1; while (!ok) {} } if (thread_id == 1) { while (!ok) {} } if (thread_id == 2) { while (!ok) {} mutex_lock(&mut); } for (int i = 0; i < SPINS; i++) { counter++; } mutex_lock(&mut_res); result[thread_id] = result_it++; mutex_unlock(&mut_res); if (arg == 0 || arg == 2) mutex_unlock(&mut); } int test() { pthread_t threads[3]; int rc; pthread_attr_t attr; struct sched_param param; rc = pthread_attr_init(&attr); if (rc) return 1; rc = pthread_attr_setschedpolicy(&attr, SCHED_FIFO); if (rc) return 1; for (int i = 0; i < 3; i++) { rc = pthread_attr_getschedparam(&attr, &param); if (rc) return 1; param.sched_priority = i + sched_get_priority_min(SCHED_FIFO); rc = pthread_attr_setschedparam(&attr, &param); if (rc) return 1; rc = pthread_create(&threads[i], &attr, thread_func, i); if (rc) return 1; } while (!ok2) {} ok = 1; for (int i = 0; i < 3; i++) pthread_join(threads[i], NULL); pthread_attr_destroy(&attr); } #define TEST_CNT 50 int main() { mutex_init(&mut, NULL); mutex_init(&mut_res, NULL); int pi_cnt = 0, cnt[3] = {0}; for (int i = 0; i < TEST_CNT; i++) { ok = ok2 = result_it = 0; test(); cnt[result[1]]++; if (result[1] < result[2]) { pi_cnt++; } } printf("The count with respect to the order of the finish of threads (in thread id):\n" "1->0->2: %d\n" "0->1->2: %d\n" "0->2->1: %d\n", cnt[0], cnt[1], cnt[2]); printf("%d out of %d priority inversions happened\n", pi_cnt, TEST_CNT); } ``` 此程式碼重現最經典的 priority inversion 的案例：高 priority 的 thread 等待著低 priority 的 thread 的 mutex，並使得中 priority 的 thread 得以先行執行完成。 Scheduling policy 採用 `SCHED_FIFO`，並透過大量 CPU 操作使得 scheduling 可以確實發生。並配合原有的實作，編譯執行： ```bash $ gcc -std=c11 -lpthread -DUSE_LINUX -D_GNU_SOURCE pi.c -o pi $ ./pi The count with respect to the order of the finish of threads (in thread id): 1->0->2: 24 0->1->2: 26 0->2->1: 0 50 out of 50 priority inversions happened ``` 發現有大量 Priority Inversion 的發生。 ### 實作 Priority Inheritance 首先擴展 `futex.h` 的界面： ```c static inline void futex_trylock_pi(atomic int *futex) { syscall(SYS_futex, futex, FUTEX_TRYLOCK_PI_PRIVATE); } static inline void futex_lock_pi(atomic int *futex) { syscall(SYS_futex, futex, FUTEX_LOCK_PI_PRIVATE); } static inline void futex_unlock_pi(atomic int *futex) { syscall(SYS_futex, futex, FUTEX_UNLOCK_PI_PRIVATE); } /* wake up one waiters, and re-queue 'limit' waiters onto a different futex */ static inline void futex_cmp_requeue_pi(atomic int *futex, int limit, int value, atomic int *other) { syscall(SYS_futex, futex, FUTEX_CMP_REQUEUE_PI_PRIVATE, 1, limit, other, value); } static inline void futex_wait_requeue_pi(atomic int *futex, int value, atomic int *other) { syscall(SYS_futex, futex, FUTEX_WAIT_REQUEUE_PI_PRIVATE, value, NULL, other); } ``` 再來修改 `mutex.h` 和 `cond.h` 分別產出 `mutex2.h` 和 `cond2.h`。 ```c static inline int cond_wait(cond_t *cond, mutex_t *mutex) { int seq = load(&cond->seq, relaxed); mutex_unlock(mutex); #define COND_SPINS 128 for (int i = 0; i < COND_SPINS; ++i) { if (load(&cond->seq, relaxed) != seq) { mutex_lock(mutex); return 0; } spin_hint(); } futex_wait_requeue_pi(&cond->seq, seq, &mutex->state); mutex_lock(mutex); // fetch_or(&mutex->state, AAAA, relaxed); fetch_or(&mutex->state, MUTEX_SLEEPING, relaxed); return 0; } static inline int cond_signal(cond_t *cond, mutex_t *mutex) { // fetch_add(&cond->seq, BBBB, relaxed); fetch_add(&cond->seq, 1, relaxed); // EEEE(&cond->seq, 1); futex_cmp_requeue_pi(&cond->seq, 0, cond->seq, &mutex->state); return 0; } static inline int cond_broadcast(cond_t *cond, mutex_t *mutex) { // fetch_add(&cond->seq, CCCC, relaxed); fetch_add(&cond->seq, 1, relaxed); // futex_requeue(&cond->seq, DDDD, &mutex->state); futex_cmp_requeue_pi(&cond->seq, INT_MAX, cond->seq, &mutex->state); return 0; } ```c static bool mutex_trylock(mutex_t *mutex) { int tmp = 0, tid = gettid(); if (compare_exchange_strong(&mutex->state, &tmp, tid, relaxed, relaxed)) return true; futex_trylock_pi(&mutex->state); tmp = load(&mutex->state, relaxed); thread_fence(&mutex->state, acquire); return (tmp & tid) == tid; } static inline int mutex_lock(mutex_t *mutex) { futex_lock_pi(&mutex->state); /* #define MUTEX_SPINS 128 for (int i = 0; i < MUTEX_SPINS; ++i) { if (mutex_trylock(mutex)) return 0; spin_hint(); } futex_lock_pi(&mutex->state); */ thread_fence(&mutex->state, acquire); return 0; } static inline int mutex_unlock(mutex_t *mutex) { futex_unlock_pi(&mutex->state); /* if (load(&mutex->state, relaxed) & FUTEX_WAITERS) futex_unlock_pi(&mutex->state); else exchange(&mutex->state, 0, release); */ return 0; } ``` 最後擴展 `main.c`： ```c #ifdef USE_PI #include "cond2.h" #include "mutex2.h" #else #include "cond.h" #include "mutex.h" #endif ``` #### 失敗品目前還是無法避免 priority inversion 的發生，原因尚待確認。 Reference: + [Futex Requeue PI](https://docs.kernel.org/locking/futex-requeue-pi.html) ## Q4