# linux2023 HW2: vax-r
> [GitHub](https://github.com/vax-r/linux2023-HW2)
## Survey
建立一個單向linked list的圖
每個node都由一個thread來處理
node具有一個mutex lock, condition variable, 跟一個clock
clock有一個mutex lock, condition variable跟ticks
注意到此圖所有node共用同一個clock
main thread會開出`N_NODES`個child thread進行以下處理
處理過程都是有一個迴圈, 迴圈的index作為ticks
如果該node的clock ticks比當前迴圈小
則跳出迴圈並signal該node
:::warning
TODO: 思考這樣的測試可對應到真實世界的哪些情境?
:notes: jserv
:::
main thread自己會進行一次clock_tick, clock_wait然後clock_stop
### futex
閱讀[futex(7)](https://man7.org/linux/man-pages/man7/futex.7.html)
還有[a futex overview and update](https://lwn.net/Articles/360699/)
理解到futex有數個重點
* a piece of memory shared between processes, the futex need not have identical address
* a futex is an aligned integer which is touched only by atomic assembler instructions
* Futex operation occurs entirely in user space for the noncontended case.
futex使用的system call如下
```clike
long syscall(SYS_futex, uint32_t *uaddr, int futex_op, uint32_t val,
const struct timespec *timeout, /* or: uint32_t val2 */
uint32_t *uaddr2, uint32_t val3);
```
要wake up a futex分為兩種情況
**uncontended**
assembler instructions that will cause the host CPU to atomically increment the integer. Afterward, check if it has in fact changed from 0 to 1, in which case there were no waiters and the operation is done.
**contended**
the atomic increment changed the counter from -1 (or some other negative number). If this is detected, there are waiters. User space should now set the counter to 1 and instruct the kernel to wake up any waiters using the FUTEX_WAKE operation.
:::info
**FUTEX_WAKE**
此system call頂多喚醒`val`個在`uaddr`當中等待的waiter
`val`通常設定為1(只喚醒一個waiter)或`INT_MAX`(喚醒全部waiter)
沒有保證哪個waiter會被喚醒
:::
要wait一個futex也分為兩種情況
**uncontended**
Atomically decrement the counter and check if it changed to 0, in which case the operation is done
**contended**
the process should set the counter to -1 and request that the kernel wait for another process to up the futex. This is done using the FUTEX_WAIT operation.
### futex requeue PI
PI stands for priority inheritance
傳統的pthread_cond_broadcast會喚醒所有在該condition variable上沈睡的thread
並一起競爭該lock
futex希望能做到一開始就決定好哪個waiter thread的priority最高
然後只喚醒他
### futex為何fast?
futex做到以下幾件事來讓效能提升
futex本質上就是在userspace維護一個atomic variable還有kernel space中的wait queue
在沒有競爭的情況下, 都在userspace中進行lock, unlock的操作
可以減少switch到kernel space的成本和時間, 進而提升效能
在競爭的情況下會進行system call進入kernel space處理
相較於mutex lock讓全部thread來競爭
futex可以透過排程器或priority等機制來選擇喚醒哪個thread
## Basic problems
`AAAA`:`fetch_or(&mutex->state, MUTEX_SLEEPING, relaxed);`
`BBBB`:`fetch_add(&cond->seq, 1, relaxed);`
`CCCC`:`fetch_add(&cond->seq, 1, relaxed);`
`DDDD`:`futex_requeue(&cond->seq, 1, &mutex->state);`
`EEEE`:`futex_wake(&cond->seq, 1);`
`FFFF`:`futex_wake(&mutex->state, 1);`
## Explain the code
### Working flow
根據main函式裡的程式, 推論出一開始初始化clock和所有node之後
開始建立thread試圖處理每個node
不過一開始都會先遇到thread_func當中迴圈執行條件的`clock_wait`
此時`clock->tick`是0, 所有thread都會sleep在`clock->cond`上
直到main thread執行第一個`clock_tick`
這時候`clock->ticks`變為1並喚醒一個waiter
之後main thread就進入`clock_wait`直到所有node被處理完
每個node嘗試`clock_wait`所使用的tick是自己迴圈的index
而`clock->ticks`增長的速度讓迴圈的index很快就跟不上
所以基本上`clock_wait`當中的while迴圈除了初期 裡面的條件幾乎都不會成立
另外因為node[0]沒有parent, 所以他的狀態永遠ready
其他node則因為在`node_wait`的時候要等待parent
可以推論node[0]是進行最多次`clock_tick`的node
以上步驟不斷重複直到`clock->ticks`跟`1 << N_NODES`一樣大
為了驗證我的推論是正確的
我進行以下實驗以計算每個node到底對clock貢獻幾次ticks
得到了有趣的結果
```
node 0 ticks for 32768 times
node 1 ticks for 16384 times
node 2 ticks for 8192 times
node 3 ticks for 4096 times
node 4 ticks for 2048 times
node 5 ticks for 1024 times
node 6 ticks for 512 times
node 7 ticks for 256 times
node 8 ticks for 128 times
node 9 ticks for 64 times
node 10 ticks for 32 times
node 11 ticks for 16 times
node 12 ticks for 8 times
node 13 ticks for 4 times
node 14 ticks for 2 times
node 15 ticks for 1 times
```
其實從`thread_func`可以推論出以上結果
把迴圈的index的增加想像成每個node自己私有的clock正在計時
以下稱為private clock
然後把共享的clock稱為public clock
### thread_func
若node[i]不為ready則所有node[j], j>i的node通通不能執行
所以index越大的node越不容易執行 因為要等他全部的祖先通通ready
每次private clock增加1的時候bit就翻轉一次
bit操控node要執行`node_signal`或`clock_tick`
所以bit翻轉越快的node能夠執行越多次`clock_tick`
也就是該node的private clock如果能進行得越快則越能貢獻次數
private clock的運轉速度受限於`clock_wait`也就是public clock和private clock的比較
由於node[0]沒有祖先 而private clock又很少領先public clock
所以假設node[0]不斷地翻轉自己的bit不斷的交互執行node_signal跟clock_tick
而node[0]執行次數中只有一半次數會做`node_signal`來喚醒node[1]
所以node[1]翻轉bit的速度會是node[0]的一半
所以node[1]執行`clock_tick`的次數也只會是node[0]的一半
以此類推node[i]執行`clock_tick`的次數會是node[i+1]的兩倍
不信的話我們把bit換成一個整數
藉由操控bit來讓node[i]貢獻public clock tick的次數變成node[i+1]的四倍
```clike
static void *thread_func(void *ptr)
{
struct node *self = ptr;
// bool bit = false;
int bit = 1;
for (int i = 1; clock_wait(self->clock, i); ++i) {
if (self->parent)
node_wait(self->parent);
if (bit % 4 == 0) {
node_signal(self);
} else {
clock_tick(self->clock);
mutex_lock(&self->mutex);
self->ticktime++;
mutex_unlock(&self->mutex);
}
bit++;
}
node_signal(self);
return NULL;
}
```
結果如下 果然是四倍
```
node 0 ticks for 49152 times
node 1 ticks for 12288 times
node 2 ticks for 3072 times
node 3 ticks for 768 times
node 4 ticks for 192 times
node 5 ticks for 48 times
node 6 ticks for 12 times
node 7 ticks for 3 times
node 8 ticks for 0 times
node 9 ticks for 0 times
node 10 ticks for 0 times
node 11 ticks for 0 times
node 12 ticks for 0 times
node 13 ticks for 0 times
node 14 ticks for 0 times
node 15 ticks for 0 times
```
### node_wait
主要用在等待parent node完成行為
### clock_wait
用在等待`clock->ticks`大於等於自己的迴圈index
或者停止計時
### mutex
#### mutex_trylock
```clike
static bool mutex_trylock(mutex_t *mutex)
{
int state = load(&mutex->state, relaxed);
if (state & MUTEX_LOCKED)
return false;
state = fetch_or(&mutex->state, MUTEX_LOCKED, relaxed);
if (state & MUTEX_LOCKED)
return false;
thread_fence(&mutex->state, acquire);
return true;
}
```
檢查`mutex`是否已經被鎖住, 如果是則return false
不是則進行`thread_fence`然後return true
或許不需要一開始load那段, 因為fetch_or就可以得到mutex舊值並嘗試上鎖
:::warning
我不理解`thread_fence`, 應研讀相關內容
:::
#### mutex_lock
```clike
static inline void mutex_lock(mutex_t *mutex)
{
#define MUTEX_SPINS 128
for (int i = 0; i < MUTEX_SPINS; ++i) {
if (mutex_trylock(mutex))
return;
spin_hint();
}
int state = exchange(&mutex->state, MUTEX_LOCKED | MUTEX_SLEEPING, relaxed);
while (state & MUTEX_LOCKED) {
futex_wait(&mutex->state, MUTEX_LOCKED | MUTEX_SLEEPING);
state = exchange(&mutex->state, MUTEX_LOCKED | MUTEX_SLEEPING, relaxed);
}
thread_fence(&mutex->state, acquire);
}
```
如果trylock成功則直接return
若一直沒有取得lock則跳出迴圈 先進行atomic exchange
若`state & MUTEX_LOCKED`成立代表有取得lock
則進行`futex_wait`不斷重複直到lock被釋放
:::warning
為什麼是使用`MUTEX_LOCKED | MUTEX_SLEEPING`當做跟futex比較的value?
直接使用1不就好?
應該探討此處差異
:::
#### mutex_unlock
```clike
static inline void mutex_unlock(mutex_t *mutex)
{
int state = exchange(&mutex->state, 0, release);
if (state & MUTEX_SLEEPING)
futex_wake(&mutex->state, 1);
}
```
取得lock state後檢查是否sleep
如果是的話就讓`futex_wake`叫起一個thread
和傳統的mutex_unlock在可能叫醒全部thread再讓他們全部競爭lock不同
futex得知可能有競爭狀態後, 會將訊息送到userspace中
再由userspace通知kernel挑選一個waiter出來
:::warning
如何挑選? 此處好像沒有使用PI
:::
### condition variable
#### cond_wait
```clike
static inline void cond_wait(cond_t *cond, mutex_t *mutex)
{
int seq = load(&cond->seq, relaxed);
mutex_unlock(mutex);
#define COND_SPINS 128
for (int i = 0; i < COND_SPINS; ++i) {
if (load(&cond->seq, relaxed) != seq) {
mutex_lock(mutex);
return;
}
spin_hint();
}
futex_wait(&cond->seq, seq);
mutex_lock(mutex);
fetch_or(&mutex->state, MUTEX_SLEEPING, relaxed);
}
```
首先取得原本在condition variable中的sequence然後釋放該lock
如果之後`cond->seq`跟原本的不相同, 代表可能有thread進行signal或broadcast
則又剛好是自己搶到執行權的話就把lock鎖起來然後return
不過為了避免過度使用atomic instruction, 如果太多次了就跳出迴圈然後進入`futex_wait`
被喚醒的話就代表取得執行權可以取得lock
#### cond_signal
```clike
static inline void cond_signal(cond_t *cond, mutex_t *mutex)
{
fetch_add(&cond->seq, 1, relaxed);
futex_wake(&cond->seq, 1);
}
```
改變seq並使用futex挑選一個waiter起來
#### cond_broadcast
```clike
static inline void cond_broadcast(cond_t *cond, mutex_t *mutex)
{
fetch_add(&cond->seq, 1, relaxed);
futex_requeue(&cond->seq, 1, &mutex->state);
}
```
使用`futex_requeue`在所有waiter當中挑選一個wake up
然後把剩下的放到`mutex`上繼續睡
### 針對pthread強化相關futex機制
[PI-futex](https://www.kernel.org/doc/Documentation/pi-futex.txt)
因為userspace當中發生競爭情況時
我們沒辦法透過取消中斷或者spinlock的方式來讓task non-preemptible in critical section
所以我們使用priority inheritance來強化它
fastpath PI-enabled pthread mutex完全不需要kernel space的介入
> 'robustness' and 'PI' are two orthogonal properties of futexes, and all four combinations are possible: futex, robust-futex, PI-futex, robust+PI-futex.
## qsort with futex
將這個輕量化的pthread用在作業一的qsort當中
先建立實驗環境 因為電腦是macOS沒辦法使用futex還有部分linux command
我加入docker container並把程式運行在當中
因為程式碼有點長 我附上對應commit連結如下
[commit](https://github.com/vax-r/linux2023-HW2/commit/a8b1922c82c7bfc2ac6166753c6ad795a1c48b74)
## 實作priority inheritance
### priority inversion
參考[Introduction to RTOS - Solution to Part 11 (Priority Inversion)](https://www.digikey.com/en/maker/projects/introduction-to-rtos-solution-to-part-11-priority-inversion/abf4b8f7cd4a4c70bece35678d178321)
Priority inversion發生在具有higher priority的task被lower priority的task給中斷的時候
> Priority inversion is a bug that occurs when a high priority task is indirectly preempted by a low priority task.
注意此處的中斷是**indirectly**, 非直接的
畢竟應該不會有排程算法讓low priority task主動地去中斷high priority task的執行, 不然設置這個priority要幹嘛
通常是因為high priority task有不得已的原因才得讓出執行權
例如high priority task要求取得一個mutex lock, 而該lock剛好在low priority task中
這時候high priority task只好乖乖將執行權交出去
priority inversion又分為bounded跟unbounded
* **bounded priority inversion**

從圖中可以看到有兩個task, task H跟task L
顧名思義, task H有比較高的priority
由於task H想要取得的lock在task L手中, 因此task H必須先讓出執行權
不過可以確定的是task L會在一段時間後釋放lock, 這時候task H就可以繼續執行
所以waiting time是bounded的
* **unbounded priority inversion**

若這時候有第三個task稱為task M, 他的priority剛好比task L高但是比task H低
若在task H因為lock的緣故釋放執行權給task L
因為task M的priority比task L高, 所以他可以中斷task L然後取得執行權
在這樣的情況下不知道task M什麼時候會把執行權還給task L
也就不知道task L什麼時候才會釋放lock給task H
所以waiting time是unbounded的
:::info
Priority inversion曾經造成送到火星上的機器故障
[What really happened on Mars Rover Pathfinder
](http://www.cs.cornell.edu/courses/cs614/1999sp/papers/pathfinder.html)
:::
解決unbounded priority inversion的方法主要有兩種
分別是priority ceiling protocol和priority inheritance
### priority inheritance
透過提高task L的priority來避免task L被task M給中斷
藉此來避免unbounded priority inversion

如圖所示, 當priority inversion發生且執行權回到task L的時候
把task L的priority提高到跟task H一樣高
如此一來task M就沒有辦法中斷task L, 也就不會發生unbounded priority inversion
:::warning
不管是priority inheritance或priority ceiling protocol
都只能解決unbounded priority inversion
bounded priority inversion很難避免, 比較可能的優化方法有以下幾個
* 盡量縮短critical section當中的處理時間, 這樣bounded priority inversion的時間會比較短
* 不要使用可能把high priority task給block住的lock
* Use one task to control a shared resource to avoid the need to create locks to protect it (我不知道這是指什麼樣的task)
:::
### Understand Linux Kernel PI-mutex
先了解linux kernel如何實作出能達成priority inheritance的mutex
[RT-mutex subsystem with PI support](https://docs.kernel.org/locking/rt-mutex.html)
Linux kernel透過RT-mutex with priority inheritance來做出PI-futex
也做到讓pthread_mutex_t有priority inheritance attribute
文章中有提到
> If the temporarily boosted owner blocks on a rt-mutex itself it propagates the priority boosting to the owner of the other rt_mutex it gets blocked on.
它提到高優先權的waiter會**propagates** the priority boosting to the owner
並且rtmutex有一個waiter tree, 對該tree做enqueue必須遵守priority order
如果priority相同的使用FIFO
只有top priority waiter會被enqueue到owner的priority waiter tree
每當top priority waiter改變, owner的priority就會被改變
priority enqueueing是由`pi_waiters`處理
接著了解RT-mutex的實作
[RT-mutex implementation design](https://docs.kernel.org/locking/rt-mutex-design.html)
首先了解幾個名詞
* PI chain
an ordered series of locks and processes that cause processes to inherit priorities from a previous process that is blocked on one of its locks.
```
E->L4->D->L3->C-+
+->L2-+
| |
G-+ +->B->L1->A
|
F->L5-+
```
PI chain看起來就像上圖 E->L4代表E嘗試取得L4 一連串連鎖關係
注意PI chain never diverge, only converge
為了讓PI成功, 在chain右邊的processes的priority需要大於等於在chain左邊的process
舉例來說如果G有最高的priority則A,B的priority都要被提高到跟G一樣
* mutex
在以下的內容中, 為了區分實作PI的locks跟PI code中使用的spin locks, PI locks稱為mutex
* lock
用lock來稱呼保護部分PI algorithm的spin locks
* waiter
它是一個struct被存在stack of blocking process on the mutex, 且它可以被放在prcess's stack
這個struct有a pointer to the task, 還有the mutex that the task is blocked on.
他也有**rbtree node structure**
這樣可以被放入mutex的waiter rbtree還有pi_waiters的rbtree of a mutex owner task
*task跟process是用來區分兩個processes的名詞*
* **mutex waiters tree**
mutex利用rbtree儲存block在自己身上的waiters
並且mutex利用一個spin lock叫wait_lock來保護這棵樹
* **task PI tree**
每個process都有自己的PI rbtree
This is a tree of all top waiters of the mutexes that are owned by the process.
如果這個process要繼承新的priority, 那肯定是在這個task PI tree最頂端的那個task priority
這棵樹稱為pi_waiters, 由一個spin lock叫做pi_lock保護
注意當使用pi_lock, interrupt要取消
* **Mutex owner and flags**
mutex struct保有a pointer to the owner
因為task struct至少是2 byte alignment, 所以可以把has waiter的bit塞在mutex的the least significant bit
[Futex Requeue PI](https://docs.kernel.org/locking/futex-requeue-pi.html)
### 重現priority inversion
在實作出PI-mutex之前, 要先做出priority inversion的情境再試圖解決它
注意在進行此實驗的時候一定要用root權限
並且使用`taskset`指令將thread全部綁在同一顆cpu上執行
pthread預設是priority number越低代表優先權越低
```c
#include <unistd.h>
#include <pthread.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
static pthread_mutex_t lock1, lock2;
void *task1(void *arg)
{
pthread_mutex_lock(&lock1);
sleep(1);
int t;
struct sched_param param;
pthread_t id = *(pthread_t *) arg;
pthread_getschedparam(id, &t, ¶m);
printf("task1 priority: %d\n", param.sched_priority);
pthread_mutex_unlock(&lock1);
return NULL;
}
void *task2(void *arg)
{
sleep(1);
int t;
struct sched_param param;
pthread_t id = *(pthread_t *) arg;
pthread_getschedparam(id, &t, ¶m);
printf("task2 priority: %d\n", param.sched_priority);
return NULL;
}
void *task3(void *arg)
{
pthread_mutex_lock(&lock1);
// sleep(1);
int t;
struct sched_param param;
pthread_t id = *(pthread_t *) arg;
pthread_getschedparam(id, &t, ¶m);
printf("task3 priority: %d\n", param.sched_priority);
pthread_mutex_unlock(&lock1);
return NULL;
}
#define THREAD_COUNT 3
int main()
{
pthread_t ids[THREAD_COUNT];
pthread_attr_t attr[THREAD_COUNT];
struct sched_param param[THREAD_COUNT];
void *result;
int t1=0, status;
printf("priority(min-max): %d-%d\n", sched_get_priority_min(t1), sched_get_priority_max(t1));
for (int i=0; i<THREAD_COUNT; i++) {
if (pthread_attr_init(&attr[i])) {
printf("Error when initializing attr %d\n", i);
exit(1);
}
}
for (int i=0; i<THREAD_COUNT; i++) {
pthread_attr_getschedparam(&attr[i], ¶m[i]);
pthread_attr_setschedpolicy(&attr[i], SCHED_RR);
pthread_attr_setinheritsched(&attr[i], PTHREAD_EXPLICIT_SCHED);
param[i].sched_priority = (i+1) * 15 + 1;
pthread_attr_setschedparam(&attr[i], ¶m[i]);
}
pthread_mutex_init(&lock1, NULL);
pthread_mutex_init(&lock2, NULL);
pthread_create(&ids[0], &attr[0], task1, (void *) &ids[0]);
pthread_create(&ids[1], &attr[1], task2, (void *) &ids[1]);
pthread_create(&ids[2], &attr[2], task3, (void *) &ids[2]);
for (int i=0; i<THREAD_COUNT; i++)
pthread_join(ids[i], &result);
for (int i=0; i<THREAD_COUNT; i++)
pthread_attr_destroy(&attr[i]);
return 0;
}
```
```bash
$ sudo gcc -O2 -o pi pthread_exp.c -lpthread
$ sudo taskset 0x1 ./pi
task2 priority: 31
task1 priority: 16
task3 priority: 46
```
### 設計並實作PI-mutex
:::info
[Github PI-implement](https://github.com/vax-r/linux2023-HW2/tree/master/PI-implement) 程式碼在這邊
:::
參考linux核心對於PI-mutex的實作
我在原本的`mutex_t`結構當中加入了新的成員
`owner`代表擁有此mutex lock的thread
`wait_lock`則是用來保護`owner`
```c
typedef struct {
atomic int state;
pthread_t *owner;
spinlock_t wait_lock;
} mutex_t;
```
:::warning
Linux kernel在mutex當中還有維護一個紅黑樹
不過此處我的情況只會有兩個thread同時搶奪所以就暫時不加上紅黑樹
有待未來改進
:::
並且重新設計`mutex_lock`跟`mutex_unlock`
#### mutex_lock
```c
static inline void mutex_lock(mutex_t *mutex, pthread_t *caller)
{
#define MUTEX_SPINS 128
for (int i = 0; i < MUTEX_SPINS; ++i) {
if (mutex_trylock(mutex)) {
spin_lock(&mutex->wait_lock);
mutex->owner = caller;
spin_unlock(&mutex->wait_lock);
return;
}
spin_hint();
}
int state = exchange(&mutex->state, MUTEX_LOCKED | MUTEX_SLEEPING, relaxed);
while (state & MUTEX_LOCKED) {
spin_lock(&mutex->wait_lock);
printf("The owner of the lock now is %lu\n", *mutex->owner);
int t_c, t_o;
struct sched_param param_c, param_o;
pthread_getschedparam(*mutex->owner, &t_o, ¶m_o);
pthread_getschedparam(*caller, &t_c, ¶m_c);
if(param_o.sched_priority < param_c.sched_priority) {
param_o.sched_priority = param_c.sched_priority;
pthread_setschedparam(*mutex->owner, t_o, ¶m_o);
}
spin_unlock(&mutex->wait_lock);
futex_wait(&mutex->state, MUTEX_LOCKED | MUTEX_SLEEPING);
state = exchange(&mutex->state, MUTEX_LOCKED | MUTEX_SLEEPING, relaxed);
}
spin_lock(&mutex->wait_lock);
mutex->owner = caller;
spin_unlock(&mutex->wait_lock);
thread_fence(&mutex->state, acquire);
}
```
在此處若thread成功取得lock則更新lock owner
如果已經有其他thread擁有這個lock, 則拿自己也就是caller跟owner的priority比較
若發現owner的priority較低則提昇它的priority
#### mutex_unlock
```c
static inline void mutex_unlock(mutex_t *mutex)
{
int state = exchange(&mutex->state, 0, release);
spin_lock(&mutex->wait_lock);
mutex->owner = NULL;
spin_unlock(&mutex->wait_lock);
if (state & MUTEX_SLEEPING)
futex_wake(&mutex->state, 1);
}
```
當thread釋放mutex lock的時候
thread owner也設置為NULL
#### 實驗結果
透過將原本程式當中所有`pthread_mutex_t`換成`mutex_t`
還有lock跟unlock使用我所定義的function之後
實驗結果如下
```bash
$ sudo gcc -O2 -o pi pthread_exp.c -lpthread
$ sudo taskset 0x1 ./pi
The owner of the lock now is 139896936539904
139896936539904 task1 priority: 46
139896919754496 task3 priority: 46
139896928147200 task2 priority: 31
```
可以發現thread 1的priority確實被提高了
而且也不會被thread 2搶先執行造成unbounded priority inversion
lock釋放之後會由thread 3先執行完畢
:::warning
目前只能算是解決一個簡單的特例
對於在多個lock和多個thread下可能發生的連鎖priority inversion並沒有解決
需要為每個thread還有mutex維護一個priority queue
可以作為未來改進空間
linux kernel是使用rbtree來作為這個priority queue
可以思考使用其他資料結構是否能得到更好表現
:::
## skinny-mutex
TODO