linux2023 HW2: vax-r

# linux2023 HW2: vax-r > [GitHub](https://github.com/vax-r/linux2023-HW2) ## Survey 建立一個單向linked list的圖每個node都由一個thread來處理 node具有一個mutex lock, condition variable, 跟一個clock clock有一個mutex lock, condition variable跟ticks 注意到此圖所有node共用同一個clock main thread會開出`N_NODES`個child thread進行以下處理處理過程都是有一個迴圈, 迴圈的index作為ticks 如果該node的clock ticks比當前迴圈小則跳出迴圈並signal該node :::warning TODO: 思考這樣的測試可對應到真實世界的哪些情境？ :notes: jserv ::: main thread自己會進行一次clock_tick, clock_wait然後clock_stop ### futex 閱讀[futex(7)](https://man7.org/linux/man-pages/man7/futex.7.html) 還有[a futex overview and update](https://lwn.net/Articles/360699/) 理解到futex有數個重點 * a piece of memory shared between processes, the futex need not have identical address * a futex is an aligned integer which is touched only by atomic assembler instructions * Futex operation occurs entirely in user space for the noncontended case. futex使用的system call如下 ```clike long syscall(SYS_futex, uint32_t *uaddr, int futex_op, uint32_t val, const struct timespec *timeout, /* or: uint32_t val2 */ uint32_t *uaddr2, uint32_t val3); ``` 要wake up a futex分為兩種情況 **uncontended** assembler instructions that will cause the host CPU to atomically increment the integer. Afterward, check if it has in fact changed from 0 to 1, in which case there were no waiters and the operation is done. **contended** the atomic increment changed the counter from -1 (or some other negative number). If this is detected, there are waiters. User space should now set the counter to 1 and instruct the kernel to wake up any waiters using the FUTEX_WAKE operation. :::info **FUTEX_WAKE** 此system call頂多喚醒`val`個在`uaddr`當中等待的waiter `val`通常設定為1(只喚醒一個waiter)或`INT_MAX`(喚醒全部waiter) 沒有保證哪個waiter會被喚醒 ::: 要wait一個futex也分為兩種情況 **uncontended** Atomically decrement the counter and check if it changed to 0, in which case the operation is done **contended** the process should set the counter to -1 and request that the kernel wait for another process to up the futex. This is done using the FUTEX_WAIT operation. ### futex requeue PI PI stands for priority inheritance 傳統的pthread_cond_broadcast會喚醒所有在該condition variable上沈睡的thread 並一起競爭該lock futex希望能做到一開始就決定好哪個waiter thread的priority最高然後只喚醒他 ### futex為何fast? futex做到以下幾件事來讓效能提升 futex本質上就是在userspace維護一個atomic variable還有kernel space中的wait queue 在沒有競爭的情況下, 都在userspace中進行lock, unlock的操作可以減少switch到kernel space的成本和時間, 進而提升效能在競爭的情況下會進行system call進入kernel space處理相較於mutex lock讓全部thread來競爭 futex可以透過排程器或priority等機制來選擇喚醒哪個thread ## Basic problems `AAAA`:`fetch_or(&mutex->state, MUTEX_SLEEPING, relaxed);` `BBBB`:`fetch_add(&cond->seq, 1, relaxed);` `CCCC`:`fetch_add(&cond->seq, 1, relaxed);` `DDDD`:`futex_requeue(&cond->seq, 1, &mutex->state);` `EEEE`:`futex_wake(&cond->seq, 1);` `FFFF`:`futex_wake(&mutex->state, 1);` ## Explain the code ### Working flow 根據main函式裡的程式, 推論出一開始初始化clock和所有node之後開始建立thread試圖處理每個node 不過一開始都會先遇到thread_func當中迴圈執行條件的`clock_wait` 此時`clock->tick`是0, 所有thread都會sleep在`clock->cond`上直到main thread執行第一個`clock_tick` 這時候`clock->ticks`變為1並喚醒一個waiter 之後main thread就進入`clock_wait`直到所有node被處理完每個node嘗試`clock_wait`所使用的tick是自己迴圈的index 而`clock->ticks`增長的速度讓迴圈的index很快就跟不上所以基本上`clock_wait`當中的while迴圈除了初期裡面的條件幾乎都不會成立另外因為node[0]沒有parent, 所以他的狀態永遠ready 其他node則因為在`node_wait`的時候要等待parent 可以推論node[0]是進行最多次`clock_tick`的node 以上步驟不斷重複直到`clock->ticks`跟`1 << N_NODES`一樣大為了驗證我的推論是正確的我進行以下實驗以計算每個node到底對clock貢獻幾次ticks 得到了有趣的結果 ``` node 0 ticks for 32768 times node 1 ticks for 16384 times node 2 ticks for 8192 times node 3 ticks for 4096 times node 4 ticks for 2048 times node 5 ticks for 1024 times node 6 ticks for 512 times node 7 ticks for 256 times node 8 ticks for 128 times node 9 ticks for 64 times node 10 ticks for 32 times node 11 ticks for 16 times node 12 ticks for 8 times node 13 ticks for 4 times node 14 ticks for 2 times node 15 ticks for 1 times ``` 其實從`thread_func`可以推論出以上結果把迴圈的index的增加想像成每個node自己私有的clock正在計時以下稱為private clock 然後把共享的clock稱為public clock ### thread_func 若node[i]不為ready則所有node[j], j>i的node通通不能執行所以index越大的node越不容易執行因為要等他全部的祖先通通ready 每次private clock增加1的時候bit就翻轉一次 bit操控node要執行`node_signal`或`clock_tick` 所以bit翻轉越快的node能夠執行越多次`clock_tick` 也就是該node的private clock如果能進行得越快則越能貢獻次數 private clock的運轉速度受限於`clock_wait`也就是public clock和private clock的比較由於node[0]沒有祖先而private clock又很少領先public clock 所以假設node[0]不斷地翻轉自己的bit不斷的交互執行node_signal跟clock_tick 而node[0]執行次數中只有一半次數會做`node_signal`來喚醒node[1] 所以node[1]翻轉bit的速度會是node[0]的一半所以node[1]執行`clock_tick`的次數也只會是node[0]的一半以此類推node[i]執行`clock_tick`的次數會是node[i+1]的兩倍不信的話我們把bit換成一個整數藉由操控bit來讓node[i]貢獻public clock tick的次數變成node[i+1]的四倍 ```clike static void *thread_func(void *ptr) { struct node *self = ptr; // bool bit = false; int bit = 1; for (int i = 1; clock_wait(self->clock, i); ++i) { if (self->parent) node_wait(self->parent); if (bit % 4 == 0) { node_signal(self); } else { clock_tick(self->clock); mutex_lock(&self->mutex); self->ticktime++; mutex_unlock(&self->mutex); } bit++; } node_signal(self); return NULL; } ``` 結果如下果然是四倍 ``` node 0 ticks for 49152 times node 1 ticks for 12288 times node 2 ticks for 3072 times node 3 ticks for 768 times node 4 ticks for 192 times node 5 ticks for 48 times node 6 ticks for 12 times node 7 ticks for 3 times node 8 ticks for 0 times node 9 ticks for 0 times node 10 ticks for 0 times node 11 ticks for 0 times node 12 ticks for 0 times node 13 ticks for 0 times node 14 ticks for 0 times node 15 ticks for 0 times ``` ### node_wait 主要用在等待parent node完成行為 ### clock_wait 用在等待`clock->ticks`大於等於自己的迴圈index 或者停止計時 ### mutex #### mutex_trylock ```clike static bool mutex_trylock(mutex_t *mutex) { int state = load(&mutex->state, relaxed); if (state & MUTEX_LOCKED) return false; state = fetch_or(&mutex->state, MUTEX_LOCKED, relaxed); if (state & MUTEX_LOCKED) return false; thread_fence(&mutex->state, acquire); return true; } ``` 檢查`mutex`是否已經被鎖住, 如果是則return false 不是則進行`thread_fence`然後return true 或許不需要一開始load那段, 因為fetch_or就可以得到mutex舊值並嘗試上鎖 :::warning 我不理解`thread_fence`, 應研讀相關內容 ::: #### mutex_lock ```clike static inline void mutex_lock(mutex_t *mutex) { #define MUTEX_SPINS 128 for (int i = 0; i < MUTEX_SPINS; ++i) { if (mutex_trylock(mutex)) return; spin_hint(); } int state = exchange(&mutex->state, MUTEX_LOCKED | MUTEX_SLEEPING, relaxed); while (state & MUTEX_LOCKED) { futex_wait(&mutex->state, MUTEX_LOCKED | MUTEX_SLEEPING); state = exchange(&mutex->state, MUTEX_LOCKED | MUTEX_SLEEPING, relaxed); } thread_fence(&mutex->state, acquire); } ``` 如果trylock成功則直接return 若一直沒有取得lock則跳出迴圈先進行atomic exchange 若`state & MUTEX_LOCKED`成立代表有取得lock 則進行`futex_wait`不斷重複直到lock被釋放 :::warning 為什麼是使用`MUTEX_LOCKED | MUTEX_SLEEPING`當做跟futex比較的value? 直接使用1不就好？應該探討此處差異 ::: #### mutex_unlock ```clike static inline void mutex_unlock(mutex_t *mutex) { int state = exchange(&mutex->state, 0, release); if (state & MUTEX_SLEEPING) futex_wake(&mutex->state, 1); } ``` 取得lock state後檢查是否sleep 如果是的話就讓`futex_wake`叫起一個thread 和傳統的mutex_unlock在可能叫醒全部thread再讓他們全部競爭lock不同 futex得知可能有競爭狀態後, 會將訊息送到userspace中再由userspace通知kernel挑選一個waiter出來 :::warning 如何挑選？此處好像沒有使用PI ::: ### condition variable #### cond_wait ```clike static inline void cond_wait(cond_t *cond, mutex_t *mutex) { int seq = load(&cond->seq, relaxed); mutex_unlock(mutex); #define COND_SPINS 128 for (int i = 0; i < COND_SPINS; ++i) { if (load(&cond->seq, relaxed) != seq) { mutex_lock(mutex); return; } spin_hint(); } futex_wait(&cond->seq, seq); mutex_lock(mutex); fetch_or(&mutex->state, MUTEX_SLEEPING, relaxed); } ``` 首先取得原本在condition variable中的sequence然後釋放該lock 如果之後`cond->seq`跟原本的不相同, 代表可能有thread進行signal或broadcast 則又剛好是自己搶到執行權的話就把lock鎖起來然後return 不過為了避免過度使用atomic instruction, 如果太多次了就跳出迴圈然後進入`futex_wait` 被喚醒的話就代表取得執行權可以取得lock #### cond_signal ```clike static inline void cond_signal(cond_t *cond, mutex_t *mutex) { fetch_add(&cond->seq, 1, relaxed); futex_wake(&cond->seq, 1); } ``` 改變seq並使用futex挑選一個waiter起來 #### cond_broadcast ```clike static inline void cond_broadcast(cond_t *cond, mutex_t *mutex) { fetch_add(&cond->seq, 1, relaxed); futex_requeue(&cond->seq, 1, &mutex->state); } ``` 使用`futex_requeue`在所有waiter當中挑選一個wake up 然後把剩下的放到`mutex`上繼續睡 ### 針對pthread強化相關futex機制 [PI-futex](https://www.kernel.org/doc/Documentation/pi-futex.txt) 因為userspace當中發生競爭情況時我們沒辦法透過取消中斷或者spinlock的方式來讓task non-preemptible in critical section 所以我們使用priority inheritance來強化它 fastpath PI-enabled pthread mutex完全不需要kernel space的介入 > 'robustness' and 'PI' are two orthogonal properties of futexes, and all four combinations are possible: futex, robust-futex, PI-futex, robust+PI-futex. ## qsort with futex 將這個輕量化的pthread用在作業一的qsort當中先建立實驗環境因為電腦是macOS沒辦法使用futex還有部分linux command 我加入docker container並把程式運行在當中因為程式碼有點長我附上對應commit連結如下 [commit](https://github.com/vax-r/linux2023-HW2/commit/a8b1922c82c7bfc2ac6166753c6ad795a1c48b74) ## 實作priority inheritance ### priority inversion 參考[Introduction to RTOS - Solution to Part 11 (Priority Inversion)](https://www.digikey.com/en/maker/projects/introduction-to-rtos-solution-to-part-11-priority-inversion/abf4b8f7cd4a4c70bece35678d178321) Priority inversion發生在具有higher priority的task被lower priority的task給中斷的時候 > Priority inversion is a bug that occurs when a high priority task is indirectly preempted by a low priority task. 注意此處的中斷是**indirectly**, 非直接的畢竟應該不會有排程算法讓low priority task主動地去中斷high priority task的執行, 不然設置這個priority要幹嘛通常是因為high priority task有不得已的原因才得讓出執行權例如high priority task要求取得一個mutex lock, 而該lock剛好在low priority task中這時候high priority task只好乖乖將執行權交出去 priority inversion又分為bounded跟unbounded * **bounded priority inversion** ![](https://hackmd.io/_uploads/B1jIiz523.jpg) 從圖中可以看到有兩個task, task H跟task L 顧名思義, task H有比較高的priority 由於task H想要取得的lock在task L手中, 因此task H必須先讓出執行權不過可以確定的是task L會在一段時間後釋放lock, 這時候task H就可以繼續執行所以waiting time是bounded的 * **unbounded priority inversion** ![](https://hackmd.io/_uploads/S1tY3G5hh.jpg) 若這時候有第三個task稱為task M, 他的priority剛好比task L高但是比task H低若在task H因為lock的緣故釋放執行權給task L 因為task M的priority比task L高, 所以他可以中斷task L然後取得執行權在這樣的情況下不知道task M什麼時候會把執行權還給task L 也就不知道task L什麼時候才會釋放lock給task H 所以waiting time是unbounded的 :::info Priority inversion曾經造成送到火星上的機器故障 [What really happened on Mars Rover Pathfinder ](http://www.cs.cornell.edu/courses/cs614/1999sp/papers/pathfinder.html) ::: 解決unbounded priority inversion的方法主要有兩種分別是priority ceiling protocol和priority inheritance ### priority inheritance 透過提高task L的priority來避免task L被task M給中斷藉此來避免unbounded priority inversion ![](https://hackmd.io/_uploads/Sk1Klm5nn.jpg) 如圖所示, 當priority inversion發生且執行權回到task L的時候把task L的priority提高到跟task H一樣高如此一來task M就沒有辦法中斷task L, 也就不會發生unbounded priority inversion :::warning 不管是priority inheritance或priority ceiling protocol 都只能解決unbounded priority inversion bounded priority inversion很難避免, 比較可能的優化方法有以下幾個 * 盡量縮短critical section當中的處理時間, 這樣bounded priority inversion的時間會比較短 * 不要使用可能把high priority task給block住的lock * Use one task to control a shared resource to avoid the need to create locks to protect it (我不知道這是指什麼樣的task) ::: ### Understand Linux Kernel PI-mutex 先了解linux kernel如何實作出能達成priority inheritance的mutex [RT-mutex subsystem with PI support](https://docs.kernel.org/locking/rt-mutex.html) Linux kernel透過RT-mutex with priority inheritance來做出PI-futex 也做到讓pthread_mutex_t有priority inheritance attribute 文章中有提到 > If the temporarily boosted owner blocks on a rt-mutex itself it propagates the priority boosting to the owner of the other rt_mutex it gets blocked on. 它提到高優先權的waiter會**propagates** the priority boosting to the owner 並且rtmutex有一個waiter tree, 對該tree做enqueue必須遵守priority order 如果priority相同的使用FIFO 只有top priority waiter會被enqueue到owner的priority waiter tree 每當top priority waiter改變, owner的priority就會被改變 priority enqueueing是由`pi_waiters`處理接著了解RT-mutex的實作 [RT-mutex implementation design](https://docs.kernel.org/locking/rt-mutex-design.html) 首先了解幾個名詞 * PI chain an ordered series of locks and processes that cause processes to inherit priorities from a previous process that is blocked on one of its locks. ``` E->L4->D->L3->C-+ +->L2-+ | | G-+ +->B->L1->A | F->L5-+ ``` PI chain看起來就像上圖 E->L4代表E嘗試取得L4 一連串連鎖關係注意PI chain never diverge, only converge 為了讓PI成功, 在chain右邊的processes的priority需要大於等於在chain左邊的process 舉例來說如果G有最高的priority則A,B的priority都要被提高到跟G一樣 * mutex 在以下的內容中, 為了區分實作PI的locks跟PI code中使用的spin locks, PI locks稱為mutex * lock 用lock來稱呼保護部分PI algorithm的spin locks * waiter 它是一個struct被存在stack of blocking process on the mutex, 且它可以被放在prcess's stack 這個struct有a pointer to the task, 還有the mutex that the task is blocked on. 他也有**rbtree node structure** 這樣可以被放入mutex的waiter rbtree還有pi_waiters的rbtree of a mutex owner task *task跟process是用來區分兩個processes的名詞* * **mutex waiters tree** mutex利用rbtree儲存block在自己身上的waiters 並且mutex利用一個spin lock叫wait_lock來保護這棵樹 * **task PI tree** 每個process都有自己的PI rbtree This is a tree of all top waiters of the mutexes that are owned by the process. 如果這個process要繼承新的priority, 那肯定是在這個task PI tree最頂端的那個task priority 這棵樹稱為pi_waiters, 由一個spin lock叫做pi_lock保護注意當使用pi_lock, interrupt要取消 * **Mutex owner and flags** mutex struct保有a pointer to the owner 因為task struct至少是2 byte alignment, 所以可以把has waiter的bit塞在mutex的the least significant bit [Futex Requeue PI](https://docs.kernel.org/locking/futex-requeue-pi.html) ### 重現priority inversion 在實作出PI-mutex之前, 要先做出priority inversion的情境再試圖解決它注意在進行此實驗的時候一定要用root權限並且使用`taskset`指令將thread全部綁在同一顆cpu上執行 pthread預設是priority number越低代表優先權越低 ```c #include <unistd.h> #include <pthread.h> #include <sched.h> #include <stdio.h> #include <stdlib.h> static pthread_mutex_t lock1, lock2; void *task1(void *arg) { pthread_mutex_lock(&lock1); sleep(1); int t; struct sched_param param; pthread_t id = *(pthread_t *) arg; pthread_getschedparam(id, &t, &param); printf("task1 priority: %d\n", param.sched_priority); pthread_mutex_unlock(&lock1); return NULL; } void *task2(void *arg) { sleep(1); int t; struct sched_param param; pthread_t id = *(pthread_t *) arg; pthread_getschedparam(id, &t, &param); printf("task2 priority: %d\n", param.sched_priority); return NULL; } void *task3(void *arg) { pthread_mutex_lock(&lock1); // sleep(1); int t; struct sched_param param; pthread_t id = *(pthread_t *) arg; pthread_getschedparam(id, &t, &param); printf("task3 priority: %d\n", param.sched_priority); pthread_mutex_unlock(&lock1); return NULL; } #define THREAD_COUNT 3 int main() { pthread_t ids[THREAD_COUNT]; pthread_attr_t attr[THREAD_COUNT]; struct sched_param param[THREAD_COUNT]; void *result; int t1=0, status; printf("priority(min-max): %d-%d\n", sched_get_priority_min(t1), sched_get_priority_max(t1)); for (int i=0; i<THREAD_COUNT; i++) { if (pthread_attr_init(&attr[i])) { printf("Error when initializing attr %d\n", i); exit(1); } } for (int i=0; i<THREAD_COUNT; i++) { pthread_attr_getschedparam(&attr[i], &param[i]); pthread_attr_setschedpolicy(&attr[i], SCHED_RR); pthread_attr_setinheritsched(&attr[i], PTHREAD_EXPLICIT_SCHED); param[i].sched_priority = (i+1) * 15 + 1; pthread_attr_setschedparam(&attr[i], &param[i]); } pthread_mutex_init(&lock1, NULL); pthread_mutex_init(&lock2, NULL); pthread_create(&ids[0], &attr[0], task1, (void *) &ids[0]); pthread_create(&ids[1], &attr[1], task2, (void *) &ids[1]); pthread_create(&ids[2], &attr[2], task3, (void *) &ids[2]); for (int i=0; i<THREAD_COUNT; i++) pthread_join(ids[i], &result); for (int i=0; i<THREAD_COUNT; i++) pthread_attr_destroy(&attr[i]); return 0; } ``` ```bash $ sudo gcc -O2 -o pi pthread_exp.c -lpthread $ sudo taskset 0x1 ./pi task2 priority: 31 task1 priority: 16 task3 priority: 46 ``` ### 設計並實作PI-mutex :::info [Github PI-implement](https://github.com/vax-r/linux2023-HW2/tree/master/PI-implement) 程式碼在這邊 ::: 參考linux核心對於PI-mutex的實作我在原本的`mutex_t`結構當中加入了新的成員 `owner`代表擁有此mutex lock的thread `wait_lock`則是用來保護`owner` ```c typedef struct { atomic int state; pthread_t *owner; spinlock_t wait_lock; } mutex_t; ``` :::warning Linux kernel在mutex當中還有維護一個紅黑樹不過此處我的情況只會有兩個thread同時搶奪所以就暫時不加上紅黑樹有待未來改進 ::: 並且重新設計`mutex_lock`跟`mutex_unlock` #### mutex_lock ```c static inline void mutex_lock(mutex_t *mutex, pthread_t *caller) { #define MUTEX_SPINS 128 for (int i = 0; i < MUTEX_SPINS; ++i) { if (mutex_trylock(mutex)) { spin_lock(&mutex->wait_lock); mutex->owner = caller; spin_unlock(&mutex->wait_lock); return; } spin_hint(); } int state = exchange(&mutex->state, MUTEX_LOCKED | MUTEX_SLEEPING, relaxed); while (state & MUTEX_LOCKED) { spin_lock(&mutex->wait_lock); printf("The owner of the lock now is %lu\n", *mutex->owner); int t_c, t_o; struct sched_param param_c, param_o; pthread_getschedparam(*mutex->owner, &t_o, &param_o); pthread_getschedparam(*caller, &t_c, &param_c); if(param_o.sched_priority < param_c.sched_priority) { param_o.sched_priority = param_c.sched_priority; pthread_setschedparam(*mutex->owner, t_o, &param_o); } spin_unlock(&mutex->wait_lock); futex_wait(&mutex->state, MUTEX_LOCKED | MUTEX_SLEEPING); state = exchange(&mutex->state, MUTEX_LOCKED | MUTEX_SLEEPING, relaxed); } spin_lock(&mutex->wait_lock); mutex->owner = caller; spin_unlock(&mutex->wait_lock); thread_fence(&mutex->state, acquire); } ``` 在此處若thread成功取得lock則更新lock owner 如果已經有其他thread擁有這個lock, 則拿自己也就是caller跟owner的priority比較若發現owner的priority較低則提昇它的priority #### mutex_unlock ```c static inline void mutex_unlock(mutex_t *mutex) { int state = exchange(&mutex->state, 0, release); spin_lock(&mutex->wait_lock); mutex->owner = NULL; spin_unlock(&mutex->wait_lock); if (state & MUTEX_SLEEPING) futex_wake(&mutex->state, 1); } ``` 當thread釋放mutex lock的時候 thread owner也設置為NULL #### 實驗結果透過將原本程式當中所有`pthread_mutex_t`換成`mutex_t` 還有lock跟unlock使用我所定義的function之後實驗結果如下 ```bash $ sudo gcc -O2 -o pi pthread_exp.c -lpthread $ sudo taskset 0x1 ./pi The owner of the lock now is 139896936539904 139896936539904 task1 priority: 46 139896919754496 task3 priority: 46 139896928147200 task2 priority: 31 ``` 可以發現thread 1的priority確實被提高了而且也不會被thread 2搶先執行造成unbounded priority inversion lock釋放之後會由thread 3先執行完畢 :::warning 目前只能算是解決一個簡單的特例對於在多個lock和多個thread下可能發生的連鎖priority inversion並沒有解決需要為每個thread還有mutex維護一個priority queue 可以作為未來改進空間 linux kernel是使用rbtree來作為這個priority queue 可以思考使用其他資料結構是否能得到更好表現 ::: ## skinny-mutex TODO