# 2021q3 Homework3 (vpoll) contributed by < [`RoyWFHuang`](https://github.com/RoyWFHuang) > ###### tags: `linux_class` `kernel` > [2021 年暑期 第 4 週隨堂測驗題目](https://hackmd.io/@sysprog/linux2021-summer-quiz4) > [2021 年暑期 Linux 核心 Homework3](https://hackmd.io/@sysprog/linux2021-summer/https%3A%2F%2Fhackmd.io%2F%40sysprog%2FSyM7Y6e6u) 開發環境為: OS: ubuntu 20.04 kernel ver: 5.4.0-77-generic CPU arch: x86_64 使用 multipass 開發 ```shell $ multipass info dev Name: dev State: Running IPv4: 10.195.227.125 Release: Ubuntu 20.04.2 LTS Image hash: 1d3f69a8984a (Ubuntu 20.04 LTS) Load: 0.00 0.00 0.00 Disk usage: 1.9G out of 4.7G Memory usage: 136.6M out of 981.2M Mounts: /home/roy/src-code => /home/roy/src-code UID map: 1000:default GID map: 1000:default In multipass shell $ uname -a Linux dev 5.4.0-80-generic #90-Ubuntu ``` --- ##### 解釋上述程式碼運作原理,並探討對應的 memory order 議題 這題遇到的最大的問題是, 下面這段程式碼是要選擇 ```c= if (waitqueue_active(&vpoll_data->wqh)) wake_up_poll(&vpoll_data->wqh, vpoll_data->events); ``` or ```c= if (waitqueue_active(&vpoll_data->wqh)) wake_up_locked_poll(&vpoll_data->wqh, vpoll_data->events); ``` 程式能正常運作, 必須要選則 wake_up_locked_poll 才能做動. 先看這兩個差異性, 最後都是呼叫到 ==__wake_up_common== [wake_up_locked_poll](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L239) -> [__wake_up_locked_key((x), TASK_NORMAL, poll_to_key(m))](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L170) -> [__wake_up_common(wq_head, mode, 1, 0, key, NULL)](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L81) ```c= list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) { unsigned flags = curr->flags; int ret; if (flags & WQ_FLAG_BOOKMARK) continue; ret = curr->func(curr, mode, wake_flags, key); if (ret < 0) break; if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) break; if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) && (&next->entry != &wq_head->head)) { bookmark->flags = WQ_FLAG_BOOKMARK; list_add_tail(&bookmark->entry, &next->entry); break; } } } ``` [wake_up_poll](https://elixir.bootlin.com/linux/latest/source/include/linux/wait.h#L237) -> [__wake_up(x, TASK_NORMAL, 1, poll_to_key(m))](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L154) -> [__wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L125)-> [__wake_up_common(wq_head, mode, 1, 0, key, NULL)](https://elixir.bootlin.com/linux/latest/source/kernel/sched/wait.c#L81) ```c= do { spin_lock_irqsave(&wq_head->lock, flags); nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive, wake_flags, key, &bookmark); spin_unlock_irqrestore(&wq_head->lock, flags); } while (bookmark.flags & WQ_FLAG_BOOKMARK); ``` 很明顯的看出, 有lock 的差異狀況, 而在原本程式中, ==spin_lock_irq(&vpoll_data->wqh.lock);== 已經做了一次的 lock 了, 此時在 這邊又在呼叫一次 lock, 勢必造成 re-lock 的問題, 所以將 lock 移除後, 在將之換成 ==wake_up_poll== 進行測試 ``` ./user timeout... (1)GOT event 1 timeout... (1)GOT event 1 timeout... (1)GOT event 3 (1)GOT event 2 timeout... (1)GOT event 4 timeout... (1)GOT event 10 ``` 很明顯的是會運作的 1. 兩者的差異性: ```c= switch (cmd) { case VPOLL_IO_ADDEVENTS: vpoll_data->events |= events; break; case VPOLL_IO_DELEVENTS: vpoll_data->events &= ~events; break; default: res = -EINVAL; } if (res >= 0) { res = vpoll_data->events; if (waitqueue_active(&vpoll_data->wqh)) wake_up_poll(&vpoll_data->wqh, vpoll_data->events); } ``` ==vpoll_ioctl== 在 user space 可以同時呼叫 ==ioctl== | task 1| task 2 | | -------- | -------- | | ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLIN); | ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLPRI);| | vpoll_data->events \|= events; | | | | vpoll_data->events \|= events; | | vpoll_data->events == ??? | vpoll_data->events == ??? | 會造成 events 資料的錯誤. 2. unlocked_ioctl 在追蹤程式碼的過程中, user space 透過 ioctl去呼叫 chrdev 的 unlocked_ioctl 去做 event 處理, 但好奇的是, 為什麼是 unlocked_ioctl, 那表示還有 lock_ioctl?, 查看 [linux/fs.h](https://elixir.bootlin.com/linux/latest/source/include/linux/fs.h#L2022) ```c= ..... __poll_t (*poll) (struct file *, struct poll_table_struct *); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); unsigned long mmap_supported_flags; ..... ``` 只有 unlocked_ioctl 跟 compat_ioctl, 於是找到了這篇[What is the difference between ioctl(), unlocked_ioctl() and compat_ioctl()?](https://unix.stackexchange.com/questions/4711/what-is-the-difference-between-ioctl-unlocked-ioctl-and-compat-ioctl), 從下面的回答中, 找到[The new way of ioctl()](https://lwn.net/Articles/119652/) ``` ioctl() is one of the remaining parts of the kernel which runs under the Big Kernel Lock (BKL). In the past, the usage of the BKL has made it possible for long-running ioctl() methods to create long latencies for unrelated processes. ``` 恩, 原來又是BKL阿. ##### 修改 vpoll 核心模組,實作效能評比的特別模式,從而分析 epoll 效能(參閱 Epoll Kernel Performance Improvements 和 linux-ipc-benchmarks) 1. 移除在 userspace 需要呼叫 VPOLL_IO_DELEVENTS 移除 event 問題 修改 vpoll_poll, 在將 event 取出後, 將 event 資料設定回去 ```c = static __poll_t vpoll_poll(struct file *file, struct poll_table_struct *wait) { struct vpoll_data *vpoll_data = file->private_data; poll_wait(file, &vpoll_data->wqh, wait); __poll_t events = READ_ONCE(vpoll_data->events); vpoll_data->events &= ~events; return events; } ``` 輸出結果 ``` timeout... (1)GOT event 1 (1)GOT event 1 (1)GOT event 3 (1)GOT event 2 timeout... (1)GOT event 4 (1)GOT event 10 ``` 2. 將 child process 中的 sleep(1) 移除 出現下面結果 ``` (1)GOT event 1 (1)GOT event 17 ``` 0x17 很明顯就是 0x01 & 0x02 & 0x04 & 0x10的結果, 很明顯看出, 雖然通知 wait queue起來工作, 但是實際上仍在執行 child process, 造成 event 的堆疊 處理方法使用 ring buffer來將每一次的 event 分別紀錄 修改結構 ```c= struct vpoll_data { wait_queue_head_t wqh; // __poll_t events; __poll_t events[MAX_SZ]; int head, tail; }; ``` 修改 vpoll_iotl, 此處也將 lock 範圍縮小, 另外 queue full 的判斷, 則是採用直接讀取資料方式, 因為event 會在 reader 讀取後被清除, 所以直接判斷此處是否有 event 即可. ```c= static long vpoll_ioctl(struct file *file, unsigned int cmd, unsigned long arg) ... spin_lock_irq(&vpoll_data->wqh.lock); if (0 != vpoll_data->events[vpoll_data->tail]){ spin_unlock_irq(&vpoll_data->wqh.lock); return -EINVAL; } idx = vpoll_data->tail; vpoll_data->tail++; spin_unlock_irq(&vpoll_data->wqh.lock); switch (cmd) { case VPOLL_IO_ADDEVENTS: vpoll_data->events[idx] |= events; break; case VPOLL_IO_DELEVENTS: vpoll_data->events[idx] &= ~events; break; default: res = -EINVAL; } if (res >= 0) { // res = vpoll_data->events[vpoll_data->tail]; if (waitqueue_active(&vpoll_data->wqh)){ // WWW(&vpoll_data->wqh, vpoll_data->events); wake_up_poll(&vpoll_data->wqh, events); // wake_up_locked_poll(&vpoll_data->wqh, vpoll_data->eventss); } } ... ``` 修改 vpoll_poll, 判斷 queue empty 的方式跟 full 類似, 判斷是否有 event 寫入 ```c= static __poll_t vpoll_poll(struct file *file, struct poll_table_struct *wait) { struct vpoll_data *vpoll_data = file->private_data; poll_wait(file, &vpoll_data->wqh, wait); __poll_t events = READ_ONCE(vpoll_data->events[vpoll_data->head]); if (0 != events) { vpoll_data->events[vpoll_data->head] &= ~events; vpoll_data->head +=1; } return events; } ``` 後記: 本來是想從 wake_up_poll 中的 parameter "events" 下手, 看是否能在 poll_wait 中, 直接將之取出, 但似乎這個 key 只是當作是一個 event 通知而已, 沒辦法隨著 wait queue 一併傳遞過去.