Try  HackMD Logo HackMD

2021q3 Homework3 (vpoll)

contributed by < RoyWFHuang >

tags: linux_class kernel

2021 年暑期 第 4 週隨堂測驗題目
2021 年暑期 Linux 核心 Homework3

開發環境為:
OS: ubuntu 20.04
kernel ver: 5.4.0-77-generic
CPU arch: x86_64
使用 multipass 開發

$ multipass info dev
Name:           dev
State:          Running
IPv4:           10.195.227.125
Release:        Ubuntu 20.04.2 LTS
Image hash:     1d3f69a8984a (Ubuntu 20.04 LTS)
Load:           0.00 0.00 0.00
Disk usage:     1.9G out of 4.7G
Memory usage:   136.6M out of 981.2M
Mounts:         /home/roy/src-code => /home/roy/src-code
                    UID map: 1000:default
                    GID map: 1000:default

In multipass shell
$ uname -a
Linux dev 5.4.0-80-generic #90-Ubuntu

解釋上述程式碼運作原理,並探討對應的 memory order 議題

這題遇到的最大的問題是, 下面這段程式碼是要選擇

if (waitqueue_active(&vpoll_data->wqh)) wake_up_poll(&vpoll_data->wqh, vpoll_data->events);

or

if (waitqueue_active(&vpoll_data->wqh)) wake_up_locked_poll(&vpoll_data->wqh, vpoll_data->events);

程式能正常運作, 必須要選則 wake_up_locked_poll 才能做動.
先看這兩個差異性, 最後都是呼叫到 __wake_up_common
wake_up_locked_poll -> __wake_up_locked_key((x), TASK_NORMAL, poll_to_key(m)) -> __wake_up_common(wq_head, mode, 1, 0, key, NULL)

list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) { unsigned flags = curr->flags; int ret; if (flags & WQ_FLAG_BOOKMARK) continue; ret = curr->func(curr, mode, wake_flags, key); if (ret < 0) break; if (ret && (flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive) break; if (bookmark && (++cnt > WAITQUEUE_WALK_BREAK_CNT) && (&next->entry != &wq_head->head)) { bookmark->flags = WQ_FLAG_BOOKMARK; list_add_tail(&bookmark->entry, &next->entry); break; } } }

wake_up_poll -> __wake_up(x, TASK_NORMAL, 1, poll_to_key(m)) -> __wake_up_common_lock(wq_head, mode, nr_exclusive, 0, key);-> __wake_up_common(wq_head, mode, 1, 0, key, NULL)

do { spin_lock_irqsave(&wq_head->lock, flags); nr_exclusive = __wake_up_common(wq_head, mode, nr_exclusive, wake_flags, key, &bookmark); spin_unlock_irqrestore(&wq_head->lock, flags); } while (bookmark.flags & WQ_FLAG_BOOKMARK);

很明顯的看出, 有lock 的差異狀況, 而在原本程式中, spin_lock_irq(&vpoll_data->wqh.lock); 已經做了一次的 lock 了, 此時在 這邊又在呼叫一次 lock, 勢必造成 re-lock 的問題, 所以將 lock 移除後, 在將之換成 wake_up_poll 進行測試

./user
timeout...
(1)GOT event 1
timeout...
(1)GOT event 1
timeout...
(1)GOT event 3
(1)GOT event 2
timeout...
(1)GOT event 4
timeout...
(1)GOT event 10

很明顯的是會運作的

  1. 兩者的差異性:
switch (cmd) { case VPOLL_IO_ADDEVENTS: vpoll_data->events |= events; break; case VPOLL_IO_DELEVENTS: vpoll_data->events &= ~events; break; default: res = -EINVAL; } if (res >= 0) { res = vpoll_data->events; if (waitqueue_active(&vpoll_data->wqh)) wake_up_poll(&vpoll_data->wqh, vpoll_data->events); }

vpoll_ioctl 在 user space 可以同時呼叫 ioctl

task 1 task 2
ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLIN); ioctl(efd, VPOLL_IO_ADDEVENTS, EPOLLPRI);
vpoll_data->events |= events;
vpoll_data->events |= events;
vpoll_data->events == ??? vpoll_data->events == ???

會造成 events 資料的錯誤.

  1. unlocked_ioctl
    在追蹤程式碼的過程中, user space 透過 ioctl去呼叫 chrdev 的 unlocked_ioctl 去做 event 處理, 但好奇的是, 為什麼是 unlocked_ioctl, 那表示還有 lock_ioctl?, 查看 linux/fs.h
..... __poll_t (*poll) (struct file *, struct poll_table_struct *); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); unsigned long mmap_supported_flags; .....

只有 unlocked_ioctl 跟 compat_ioctl, 於是找到了這篇What is the difference between ioctl(), unlocked_ioctl() and compat_ioctl()?, 從下面的回答中, 找到The new way of ioctl()

ioctl() is one of the remaining parts of the kernel which runs under the Big Kernel Lock (BKL). In the past, the usage of the BKL has made it possible for long-running ioctl() methods to create long latencies for unrelated processes.

恩, 原來又是BKL阿.

修改 vpoll 核心模組,實作效能評比的特別模式,從而分析 epoll 效能(參閱 Epoll Kernel Performance Improvements 和 linux-ipc-benchmarks)
  1. 移除在 userspace 需要呼叫 VPOLL_IO_DELEVENTS 移除 event 問題

修改 vpoll_poll, 在將 event 取出後, 將 event 資料設定回去

static __poll_t vpoll_poll(struct file *file, struct poll_table_struct *wait)
{
    struct vpoll_data *vpoll_data = file->private_data;

    poll_wait(file, &vpoll_data->wqh, wait);
    __poll_t events = READ_ONCE(vpoll_data->events);
    vpoll_data->events &= ~events;
    return events;
}

輸出結果

timeout...
(1)GOT event 1
(1)GOT event 1
(1)GOT event 3
(1)GOT event 2
timeout...
(1)GOT event 4
(1)GOT event 10
  1. 將 child process 中的 sleep(1) 移除

出現下面結果

(1)GOT event 1
(1)GOT event 17

0x17 很明顯就是 0x01 & 0x02 & 0x04 & 0x10的結果, 很明顯看出, 雖然通知 wait queue起來工作, 但是實際上仍在執行 child process, 造成 event 的堆疊

處理方法使用 ring buffer來將每一次的 event 分別紀錄

修改結構

struct vpoll_data { wait_queue_head_t wqh; // __poll_t events; __poll_t events[MAX_SZ]; int head, tail; };

修改 vpoll_iotl, 此處也將 lock 範圍縮小, 另外 queue full 的判斷, 則是採用直接讀取資料方式, 因為event 會在 reader 讀取後被清除, 所以直接判斷此處是否有 event 即可.

static long vpoll_ioctl(struct file *file, unsigned int cmd, unsigned long arg) ... spin_lock_irq(&vpoll_data->wqh.lock); if (0 != vpoll_data->events[vpoll_data->tail]){ spin_unlock_irq(&vpoll_data->wqh.lock); return -EINVAL; } idx = vpoll_data->tail; vpoll_data->tail++; spin_unlock_irq(&vpoll_data->wqh.lock); switch (cmd) { case VPOLL_IO_ADDEVENTS: vpoll_data->events[idx] |= events; break; case VPOLL_IO_DELEVENTS: vpoll_data->events[idx] &= ~events; break; default: res = -EINVAL; } if (res >= 0) { // res = vpoll_data->events[vpoll_data->tail]; if (waitqueue_active(&vpoll_data->wqh)){ // WWW(&vpoll_data->wqh, vpoll_data->events); wake_up_poll(&vpoll_data->wqh, events); // wake_up_locked_poll(&vpoll_data->wqh, vpoll_data->eventss); } } ...

修改 vpoll_poll, 判斷 queue empty 的方式跟 full 類似, 判斷是否有 event 寫入

static __poll_t vpoll_poll(struct file *file, struct poll_table_struct *wait) { struct vpoll_data *vpoll_data = file->private_data; poll_wait(file, &vpoll_data->wqh, wait); __poll_t events = READ_ONCE(vpoll_data->events[vpoll_data->head]); if (0 != events) { vpoll_data->events[vpoll_data->head] &= ~events; vpoll_data->head +=1; } return events; }

後記:
本來是想從 wake_up_poll 中的 parameter "events" 下手, 看是否能在 poll_wait 中, 直接將之取出, 但似乎這個 key 只是當作是一個 event 通知而已, 沒辦法隨著 wait queue 一併傳遞過去.