# 2023 Homework2 (quiz1)
contributed by < `Cuda-Chen` >
[2023 summer quiz1 題目](https://hackmd.io/@sysprog/linux2023-summer-quiz1)
[source code hosted on GitHub](https://github.com/Cuda-Chen/linux-2023-summer/tree/main/hw2)
### futex 介紹
在 concurrency programming 中,我們可以使用 mutux + cond var 來防止 race condition。 然此種作法會需要兩個 wait 加上兩個 wake ,並且以上四次動作需要 syscall,造成效能浪費。
為了增進效能,Linux 提供 futex(fast userspace mutex),讓相關的操作可以在 userspace 進行,減少 syscall 的次數;亦提供 requeue 功能,將二個佇列之間的執行緒縮減為一次呼叫,減少競爭 mutex lock 的次數,進而提昇效能。
## 測驗 `1`
### 測驗 `1` - 1
#### 測試方法
(待補)
#### 運作原理
這個程式是計算於 coputation graph 中每個 node 之執行時間。
需要修改之地方分佈在 `cond.h` 以及 `mutex.h` 中,所以我們接下來看這兩個檔案。
看看 `AAAA` ~ `EEEE` 所在程式碼部份:
```c
// cond.h
/* 用以等待 cond var 是否已到特定條件(`cond->seq`) */
static inline void cond_wait(cond_t *cond, mutex_t *mutex)
{
int seq = load(&cond->seq, relaxed);
mutex_unlock(mutex);
#define COND_SPINS 128
for (int i = 0; i < COND_SPINS; ++i) {
if (load(&cond->seq, relaxed) != seq) {
mutex_lock(mutex);
return;
}
spin_hint();
}
futex_wait(&cond->seq, seq);
mutex_lock(mutex);
fetch_or(&mutex->state, MUTEX_SLEEPING, relaxed); // AAAA
}
/* 發出 signal 以通知其他 thread */
static inline void cond_signal(cond_t *cond, mutex_t *mutex)
{
fetch_add(&cond->seq, 1, relaxed); // BBBB
futex_wake(&cond->seq, 1); // EEEE
}
/* 發出 broadcast 給其他 thread */
static inline void cond_broadcast(cond_t *cond, mutex_t *mutex)
{
fetch_add(&cond->seq, 1 relaxed); // CCCC
futex_requeue(&cond->seq, 1, &mutex->state); // DDDD
}
```
```c
// mutex.h
/* 將 mutex unlock,並且如果 mutex 是 MUTEX_SLEEPING,發出 wake signal */
static inline void mutex_unlock(mutex_t *mutex)
{
int state = exchange(&mutex->state, 0, release);
if (state & MUTEX_SLEEPING)
futex_wake(&mutex->state, 1); // FFFF
}
```
#### futex 使用方式
根據 [futex(2)](https://man7.org/linux/man-pages/man2/futex.2.html) 提到的用法如下:
```c
#include <linux/futex.h> /* Definition of FUTEX_* constants */
#include <sys/syscall.h> /* Definition of SYS_* constants */
#include <unistd.h>
long syscall(
SYS_futex,
uint32_t *uaddr,
int futex_op,
uint32_t val,
const struct timespec *timeout, /* or: uint32_t val2 */
uint32_t *uaddr2,
uint32_t val3
);
```
由於 `futex()` 沒有提供 wrapper,所以使用的時候需要使用 `syscall()`。
我們可以參考作業中的以下兩個函式來達成 mutex 想要的 wait 以及 wake:
```c
/* Atomically check if '*futex == value', and if so, go to sleep */
static inline void futex_wait(atomic int *futex, int value)
{
syscall(SYS_futex, futex, FUTEX_WAIT_PRIVATE, value, NULL);
}
/* Wake up 'limit' threads currently waiting on 'futex' */
static inline void futex_wake(atomic int *futex, int limit)
{
syscall(SYS_futex, futex, FUTEX_WAKE_PRIVATE, limit);
}
```
#### Linux 核心開發者針對 POSIX Threads 強化哪些相關 futex 實作機制
futex 目前在有以下問題(摘自 [The futex_waitv() syscall and gaming on Linux](https://www.collabora.com/news-and-blog/blog/2023/02/17/the-futex-waitv-syscall-gaming-on-linux/)):
> 1. The futex_wait() can only be done one one futex at a time.
> 2. The futex syscall interface is complex because of the multiplexing of plenty operations.
> 3. Only the futex word of size 32-bit is supported.
> 4. The implementation of the futex isn't NUMA aware.
為此,[[PATCH v5 00/11] Add futex2 syscalls](https://lore.kernel.org/lkml/20210709001328.329716-1-andrealmeid@collabora.com/) 中提到 Collabora 為了解決以上問題,提出兩個作法:
1. 新增 `futex_waitv()` syscall
2. 讓 futex 大小可以任意調節
##### 新增 `futex_waitv()` syscall
> The futex_wait() can only be done one one futex at a time.
為了解決這個問題,我們可以使用 `eventfd()` syscall 來等待多個 mutex,然而當 mutex 數量到一定程度時,`eventfd()` 會變成 bottleneck。
因此,Collabora 引進 `futex_waitv()` syscall。`futex_waitv()` 可以對一個 futex array 進行等待。
##### 讓 futex 大小可以任意調節
futex 的大小固定是 32-bit,然而此固定大小除了不能容納更多的 debug messages 外,也會使 futex 於 NUMA 的表現產生影響。所以 Collabora 讓 futex 的大小可以根據工程師的需求進行大小調整,其應用情境如下:
1. futex 大小調整為 64-bit,用以容納更多 debug messages
2. futex 大小調整為 8-bit,在 NUMA 時可以減少資料傳輸,提昇效能
### 測驗 `1` - 2
[b2fd20a](https://github.com/Cuda-Chen/linux-2023-summer/commit/b2fd20a4c1766566729f05b74cb1e7492758beff)
主要將原本的 `pthread_mutex_xxx` 以及 `pthread_cond_xxx` 分別改成 `mutex_xxx` 以及 `cond_xxx`(有例外是 `pthread_mutex_destroy` 以及 `pthread_cond_destroy`,遇到這兩個直接刪掉即可)
執行時間如下表(可以看出跟 [作業一](https://hackmd.io/@gb_c16rKTiu-nweuRXXVNg/linux2023-summer-quiz?view#%E6%B8%AC%E9%A9%97-%CE%B3-%E2%88%92-3) 相比,當 N(input size)越來越大,執行時間有縮短):
| N | time (measure in seconds) |
| -------- | -------- |
| 10^2 | 0.00374 |
| 10^3 | 0.00629 |
| 10^4 | 0.00891 |
| 10^6 | 0.218 |
| 10^7 | 2 |
| 10^9 | 279 |
### 測驗 `1` - 3
可能可以從 futex2 開始。
執行時要用 `taskset` 限制在一個 CPU 上執行,且指定 priority 要記得使用 `sudo`
[a9b00e7](https://github.com/Cuda-Chen/linux-2023-summer/commit/a9b00e77448749221a3fcf70c19b598f5529a5a8)
首先我們可以從 [實作 priority inversion](https://hackmd.io/@sysprog/concurrency-thread-package#%E5%AF%A6%E4%BD%9C-priority-inversion) 中建立三個擁有 priority 的 process 以進行實驗。
分配 priority 以及 task 內容如下:
```c
void *task1(void *arg)
{
mutex_lock(&mtx1);
printf("1\n");
mutex_unlock(&mtx1);
return NULL;
}
void *task2(void *arg)
{
mutex_lock(&mtx2);
sleep(1);
printf("2\n");
mutex_unlock(&mtx2);
return NULL;
}
void *task3(void *arg)
{
mutex_lock(&mtx1);
sleep(1);
printf("3\n");
mutex_unlock(&mtx1);
return NULL;
}
param.sched_priority = (THREADCNT - 2) * 10;
pthread_attr_setschedparam(&attr, ¶m);
pthread_create(&th[2], NULL, task3, (void *)NULL);
param.sched_priority = (THREADCNT - 1) * 10;
pthread_attr_setschedparam(&attr, ¶m);
pthread_create(&th[1], NULL, task2, (void *)NULL);
param.sched_priority = (THREADCNT) * 10;
pthread_attr_setschedparam(&attr, ¶m);
pthread_create(&th[0], NULL, task1, (void *)NULL);
```
其中,thread 0 有最高的 priority。然而,實驗結果卻顯示 thread 2 先行完成:
```
$ ./pi-test
3
1
2
```
#### 實做 priority inheritance mutex
參考 [futex(2)](https://man7.org/linux/man-pages/man2/futex.2.html) 中的 priority inheritance mutex 段落進行實做。
[46099fd](https://github.com/Cuda-Chen/linux-2023-summer/commit/46099fd621e59988943b5b0c2d16688b53913baa)
在這個 commit 實做 PI-mutex,不過仍然遇到 priority inversion:
```
$ ./pi-test
3
1
2
```
### 測驗 `1` - 4
(待補)
首先先把執行 `make` 並且執行三個編譯出來的執行檔:
```
$ ./perf-pthreads
Measuring Locking and unlocking without contention: best 8ns, 50%ile 8ns
Measuring Locking and unlocking with contention: best 3449ns, 50%ile 5225ns
$ ./perf-skinny
Measuring Locking and unlocking without contention: best 13ns, 50%ile 13ns
Measuring Locking and unlocking with contention: best 4659ns, 50%ile 6178ns
$ ./perf-spinlock
Measuring Locking and unlocking without contention: best 9ns, 50%ile 9ns
Measuring Locking and unlocking with contention: best 97ns, 50%ile 141ns
```
### code may improve
```c
static inline void spin_lock(spinlock_t *lock)
{
/* Given the specific implementation of spin_trylock(), this results in
* a test-and-test-and-set (TTAS) loop. Refer to
* https://rigtorp.se/spinlock/ for more information.
*/
while (!spin_trylock(lock))
spin_hint();
}
```
## MISC
### WINE
- https://www.facebook.com/groups/598186762133482/search/?q=wine
- https://hackmd.io/@taipeiTech/H1LzOsFSB/https%3A%2F%2Fhackmd.io%2F%40sysprog%2Flinux-comic#Wine-%E8%A1%8C%E7%A8%8B
### futex
- https://eli.thegreenplace.net/2018/basics-of-futexes/
- https://github.com/weirddan455/futex-test
- https://lore.kernel.org/lkml/20210709001328.329716-1-andrealmeid@collabora.com/
### futex for MS gamming
- https://www.collabora.com/news-and-blog/blog/2023/02/17/the-futex-waitv-syscall-gaming-on-linux/
- https://www.phoronix.com/news/FUTEX2-v5
### futex2
- https://docs.kernel.org/userspace-api/futex2.html
### PI-futex
- https://hackmd.io/6FOQGjztRx29BnDt2OnzSw?view
- https://docs.kernel.org/locking/rt-mutex-design.html
- lockdep.c (172K)
### linux workqueue.c
- `try_to_get_pending`
### N/A
- https://www.hillelwayne.com/post/safety-and-liveness/