# 2021q3 Homework3 (vpoll)
contributed by < [`RoyWFHuang`](https://github.com/RoyWFHuang) >
###### tags: `linux_class` `kernel`
> [2021 年暑期 第 4 週隨堂測驗題目](https://hackmd.io/@sysprog/linux2021-summer-quiz4)
> [2021 年暑期 Linux 核心 Homework3](https://hackmd.io/@sysprog/linux2021-summer/https%3A%2F%2Fhackmd.io%2F%40sysprog%2FSyM7Y6e6u)
開發環境為:
OS: ubuntu 20.04
kernel ver: 5.4.0-77-generic
CPU arch: x86_64
使用 multipass 開發
```shell
$ multipass info dev
Name: dev
State: Running
IPv4: 10.195.227.125
Release: Ubuntu 20.04.2 LTS
Image hash: 1d3f69a8984a (Ubuntu 20.04 LTS)
Load: 0.00 0.00 0.00
Disk usage: 1.9G out of 4.7G
Memory usage: 136.6M out of 981.2M
Mounts: /home/roy/src-code => /home/roy/src-code
UID map: 1000:default
GID map: 1000:default
In multipass shell
$ uname -a
Linux dev 5.4.0-80-generic #90-Ubuntu
```
---
##### 解釋上述程式碼運作原理,並探討對應的 memory order 議題
這題遇到的最大的問題是, 下面這段程式碼是要選擇
```c=
if (waitqueue_active(&vpoll_data->wqh))
wake_up_poll(&vpoll_data->wqh, vpoll_data->events);
```
or
```c=
if (waitqueue_active(&vpoll_data->wqh))
wake_up_locked_poll(&vpoll_data->wqh, vpoll_data->events);
```
程式能正常運作, 必須要選則 wake_up_locked_poll 才能做動.
先看這兩個差異性, 最後都是呼叫到 ==__wake_up_common==
[wake_up_locked_poll](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L239) -> [__wake_up_locked_key((x), TASK_NORMAL, poll_to_key(m))](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L170) -> [__wake_up_common(wq_head, mode, 1, 0, key, NULL)](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L81)
```c=
list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) {
unsigned flags = curr->flags;
int ret;
if (flags & WQ_FLAG_BOOKMARK)
continue;
ret = curr->func(curr, mode, wake_flags, key);
if (ret < 0)
break;
if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
break;
if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) &&
(&next->entry != &wq_head->head)) {
bookmark->flags = WQ_FLAG_BOOKMARK;
list_add_tail(&bookmark->entry, &next->entry);
break;
}
}
}
```
[wake_up_poll](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L237) -> [__wake_up(x, TASK_NORMAL, 1, poll_to_key(m))](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L154) -> [__wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L125)-> [__wake_up_common(wq_head, mode, 1, 0, key, NULL)](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L81)
```c=
do {
spin_lock_irqsave(&wq_head->lock, flags);
nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive,
wake_flags, key, &bookmark);
spin_unlock_irqrestore(&wq_head->lock, flags);
} while (bookmark.flags & WQ_FLAG_BOOKMARK);
```
很明顯的看出, 有lock 的差異狀況, 而在原本程式中, ==spin_lock_irq(&vpoll_data->wqh.lock);== 已經做了一次的 lock 了, 此時在 這邊又在呼叫一次 lock, 勢必造成 re-lock 的問題, 所以將 lock 移除後, 在將之換成 ==wake_up_poll== 進行測試
```
./user
timeout...
(1)GOT event 1
timeout...
(1)GOT event 1
timeout...
(1)GOT event 3
(1)GOT event 2
timeout...
(1)GOT event 4
timeout...
(1)GOT event 10
```
很明顯的是會運作的
1. 兩者的差異性:
```c=
switch (cmd) {
case VPOLL_IO_ADDEVENTS:
vpoll_data->events |= events;
break;
case VPOLL_IO_DELEVENTS:
vpoll_data->events &= ~events;
break;
default:
res = -EINVAL;
}
if (res >= 0) {
res = vpoll_data->events;
if (waitqueue_active(&vpoll_data->wqh))
wake_up_poll(&vpoll_data->wqh, vpoll_data->events);
}
```
==vpoll_ioctl== 在 user space 可以同時呼叫 ==ioctl==
| task 1| task 2 |
| -------- | -------- |
| ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLIN); | ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLPRI);|
| vpoll_data->events \|= events; | |
| | vpoll_data->events \|= events; |
| vpoll_data->events == ??? | vpoll_data->events == ??? |
會造成 events 資料的錯誤.
2. unlocked_ioctl
在追蹤程式碼的過程中, user space 透過 ioctl去呼叫 chrdev 的 unlocked_ioctl 去做 event 處理, 但好奇的是, 為什麼是 unlocked_ioctl, 那表示還有 lock_ioctl?, 查看 [linux/fs.h](https://elixir.bootlin.com/linux/latest/source/include/linux/fs.h#L2022)
```c=
.....
__poll_t (*poll) (struct file *, struct poll_table_struct *);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
unsigned long mmap_supported_flags;
.....
```
只有 unlocked_ioctl 跟 compat_ioctl, 於是找到了這篇[What is the difference between ioctl(), unlocked_ioctl() and compat_ioctl()?](https://unix.stackexchange.com/questions/4711/what-is-the-difference-between-ioctl-unlocked-ioctl-and-compat-ioctl), 從下面的回答中, 找到[The new way of ioctl()](https://lwn.net/Articles/119652/)
```
ioctl() is one of the remaining parts of the kernel which runs under the Big Kernel Lock (BKL). In the past, the usage of the BKL has made it possible for long-running ioctl() methods to create long latencies for unrelated processes.
```
恩, 原來又是BKL阿.
##### 修改 vpoll 核心模組,實作效能評比的特別模式,從而分析 epoll 效能(參閱 Epoll Kernel Performance Improvements 和 linux-ipc-benchmarks)
1. 移除在 userspace 需要呼叫 VPOLL_IO_DELEVENTS 移除 event 問題
修改 vpoll_poll, 在將 event 取出後, 將 event 資料設定回去
```c =
static __poll_t vpoll_poll(struct file *file, struct poll_table_struct *wait)
{
struct vpoll_data *vpoll_data = file->private_data;
poll_wait(file, &vpoll_data->wqh, wait);
__poll_t events = READ_ONCE(vpoll_data->events);
vpoll_data->events &= ~events;
return events;
}
```
輸出結果
```
timeout...
(1)GOT event 1
(1)GOT event 1
(1)GOT event 3
(1)GOT event 2
timeout...
(1)GOT event 4
(1)GOT event 10
```
2. 將 child process 中的 sleep(1) 移除
出現下面結果
```
(1)GOT event 1
(1)GOT event 17
```
0x17 很明顯就是 0x01 & 0x02 & 0x04 & 0x10的結果, 很明顯看出, 雖然通知 wait queue起來工作, 但是實際上仍在執行 child process, 造成 event 的堆疊
處理方法使用 ring buffer來將每一次的 event 分別紀錄
修改結構
```c=
struct vpoll_data {
wait_queue_head_t wqh;
// __poll_t events;
__poll_t events[MAX_SZ];
int head, tail;
};
```
修改 vpoll_iotl, 此處也將 lock 範圍縮小, 另外 queue full 的判斷, 則是採用直接讀取資料方式, 因為event 會在 reader 讀取後被清除, 所以直接判斷此處是否有 event 即可.
```c=
static long vpoll_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
...
spin_lock_irq(&vpoll_data->wqh.lock);
if (0 != vpoll_data->events[vpoll_data->tail]){
spin_unlock_irq(&vpoll_data->wqh.lock);
return -EINVAL;
}
idx = vpoll_data->tail;
vpoll_data->tail++;
spin_unlock_irq(&vpoll_data->wqh.lock);
switch (cmd) {
case VPOLL_IO_ADDEVENTS:
vpoll_data->events[idx] |= events;
break;
case VPOLL_IO_DELEVENTS:
vpoll_data->events[idx] &= ~events;
break;
default:
res = -EINVAL;
}
if (res >= 0) {
// res = vpoll_data->events[vpoll_data->tail];
if (waitqueue_active(&vpoll_data->wqh)){
// WWW(&vpoll_data->wqh, vpoll_data->events);
wake_up_poll(&vpoll_data->wqh, events);
// wake_up_locked_poll(&vpoll_data->wqh, vpoll_data->eventss);
}
}
...
```
修改 vpoll_poll, 判斷 queue empty 的方式跟 full 類似, 判斷是否有 event 寫入
```c=
static __poll_t vpoll_poll(struct file *file, struct poll_table_struct *wait)
{
struct vpoll_data *vpoll_data = file->private_data;
poll_wait(file, &vpoll_data->wqh, wait);
__poll_t events = READ_ONCE(vpoll_data->events[vpoll_data->head]);
if (0 != events)
{
vpoll_data->events[vpoll_data->head] &= ~events;
vpoll_data->head +=1;
}
return events;
}
```
後記:
本來是想從 wake_up_poll 中的 parameter "events" 下手, 看是否能在 poll_wait 中, 直接將之取出, 但似乎這個 key 只是當作是一個 event 通知而已, 沒辦法隨著 wait queue 一併傳遞過去.