epoll / file descripter

--- ###### tags: `linux2022` --- # epoll / file descripter contributed by < `eric88525` > 引述 epoll(2) > The epoll API performs a similar task to poll(2): monitoring multiple file descriptors to see if I/O is possible on any of them. The epoll API can be used either as an edge-triggered or a level-triggered interface and scales well to large numbers of watched file descriptors. # Overview + 概念像是請人代為監控 fd 狀態，有需要時就直接拿取 ready 的 fd + epoll 代表 event poll，是 linux 的特殊結構 + 允許 process 監視多個 file descriptors ，並在 I/O 可以執行時得到提醒 (edge-triggered 和 level-triggered) + epoll 不是 system call，而是一種 kernel **資料結構** ![](https://i.imgur.com/HZLqNqA.png) # epoll 語法 ## 1. epoll create 建立 epoll instance，此 system call 回傳 epoll instance 的 file descriptior。 + 參數 + **size** : 希望 process 監視多少個 file descriptor。在 linux 2.6.8 之後取消，改為動態決定 size ```c #include <sys/epoll.h> int epoll_create(int size); ``` ![](https://i.imgur.com/OOQ9YPQ.jpg) 有令一個用法 `epoll_create1(int flags)`， flags 可為 `0` 或是 `EPOLL_CLOEXEC`，當為 `0` 就跟 epoll_create 一樣 ```c int epoll_create1(int flags); ``` 當 flags = `EPOLL_CLOEXEC`，被 fork 出去的 child process 會在 exec() 以前先關掉 epoll descriptor，讓 child process 不能用 epoll instance。 ## 2. epoll_ctl process 可以透過此函式來增加想觀察的 file descriptor 被註冊的 fd 稱做 `epoll set` 或是 `interest list` ![](https://i.imgur.com/BznWlzp.jpg) ready list 是 interest list 的子集合 ![](https://i.imgur.com/MDOciut.jpg) 實際用法 ```c #include <sys/epoll.h> int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); ``` + 參數: + epfd: 透過 `epoll_create()` 建立的 file descriptor，用以識別 epoll instance + op: operation 有三種，分別是 + 註冊: `EPOLL_CTL_ADD` + 刪除: `EPOLL_CTL_DEL` + 修改: `EPOLL_CTL_MOD` + fd: 想加入到 intererst / ready list 的 file descriptor + event: 指向 `epoll_event` structure 的指標。 + `epoll_event` 包含 **events** 和 **data** + events 為一 bitmask，指出要觀察哪種 events，像是 (EPOLLIN, EPOLLOUT) [事件列表](https://man7.org/linux/man-pages/man2/epoll_ctl.2.html)。 + data: epoll_data 型態，kernel 需要傳回的資料 ```c typedef union epoll_data { void *ptr; int fd; uint32_t u32; uint64_t u64; } epoll_data_t; struct epoll_event { uint32_t events; /* Epoll events */ epoll_data_t data; /* User data variable */ }; ``` ![](https://i.imgur.com/R0s5K4h.jpg) ## 3. epoll_wait 透過 epoll_wait system call，thread 可以得知在 epoll set/interest set 內哪些 event 被觸發。 ```c #include <sys/epoll.h> int epoll_wait(int epfd, struct epoll_event *evlist, int maxevents, int timeout); ``` | name | description | | -------- | -------- | | epfd | 透過 `epoll_create()` 建立的 file descriptor，用以識別 epoll instance | | evlist | epoll_event 的 array，執行完 epoll_wait 後會被填充，用來得知哪些 fd 在 ready list | | maxevents | length of evlist | | timeout | block 時間(ms) | **timeout** + 0: 檢查完成後就離開，不會block + -1: process 永遠的等待 (sleep)，直到 epoll_wait 回傳 + $>0$: 有回傳或是時間數完才離開 **return value** + -1: 出事了 [error codes](https://www.gnu.org/software/libc/manual/html_node/Error-Codes.html) + 0: fd 都不在 ready list + $>0$: 如果有只少一個 fd 在 ready list，returns the number of file descriptors ready for the requested I/O，接著就能檢查 evlist 來看哪些 fd 有事件發生。 # epoll 的陷阱 ## file descriptors (描述符) epoll 跟 fd 息息相關，因此需要先了解 fd process 透過 file descriptor 來與 i/o streams 有關聯。每個 process 都有自己的 fd table，有兩個欄位 **flag** 和 **pointer**，flag 只有一種選項 `close on exec`，這個後續提到。 ![](https://i.imgur.com/hC3IbxU.png) descriptor 可以透過 sysyem call 像是 open, pipe, socker 創建，或是透過 **fork**。當 process exits 或是 close 都會**關閉 file descriptor**，還有一種情況：標記為 `close on exec` 的 descriptor 在 fork 後，只讓 parent 使用 descriptor，child process 則關閉 descriptor。 Process b 由 a fork 而來，在 b exec() 以前 descriptor 就會被**標記 inactive 而無法使用。** ![](https://i.imgur.com/guFFB27.png) KERNEL 另有維護一個 open file table, 裡面記載所有被打開的 file (如果某檔案被兩個 process開，那就會有兩欄) ![](https://i.imgur.com/t3wPqKh.png) 當 process 被 fork 出來，descriptor 也會被複製並指向相同地方，如果更改其中一者的 offset 其他被複製出來的也會受影響 (他們是連動的)。 ![](https://i.imgur.com/TrYL4lU.png) ## innode table 介紹 innode 是 file system 的 data structure，裡面記載 file system object 的**物件資訊**。資訊包含: + location: 資料儲存在哪個 block 或是 disk + file 和 directory 的屬性 + 額外的 metadata，像是 access time, owner, permissions... 每個存在於檔案系統的檔案都包含 **inode entry**，又稱作 **inode number**，用以指向檔案。而 innode table 用來紀錄 inode number 和 inode structure 的對應。下圖表示 Process A 在打開 `abc.txt` 後產生 **fd5**，Proces B 打開同一份檔案後產生 **fd10**，雖然他們在 open file table 指向不同地方，但最後指向同一份檔案。 ![](https://i.imgur.com/mRQ8c37.png) # epoll 核心以下是 process A 打開兩個不同的檔案 ![](https://i.imgur.com/7dFU0kv.png) process A 呼叫 `epoll_create` 建立 epoll instance，**fd9** 作為 file descriptor。 ![](https://i.imgur.com/X6esq6N.png) 透過 `epoll_ctl` 新增要監視的 fd， **fd0** 新增到 interest list。 ![](https://i.imgur.com/bGftvmO.png) 如果此時 fork 出 Process B，B 繼承 A 的 fd table，就連 **fd9**都共享。不論 Process A 新增啥到 interest list，Process B 都會收到提醒。 ![](https://i.imgur.com/tfOqZaF.png) 就算有 process 關閉了被 epoll 關注的檔案，他還是會收到提醒。 # 跟 select / poll 比較 select / poll 複雜度 $O(N)$，每當檢查時都會全掃一遍。假設是網站的話就要把所有 client 都檢查一次。 epoll 則只需要呼叫 `epoll_wait` ，拿到的就都是有 event 發生的 fd。 As a result, the cost of epoll is O(number of events that have occurred) and not O(number of descriptors being monitored) as was the case with select/poll. # level trigger 條件觸發 / edge trigger 邊緣觸發 + 條件觸發(滿足條件就產生 io事件) + 邊緣觸發(狀態變化時發生一個 io 事件) 預設 epoll 提供 **level-triggered notifications**，每當呼叫 `epoll_wait`，只回傳 ready list。就像下圖只回傳 [fd2, fd3]。 ![](https://i.imgur.com/SFzDBJJ.jpg) 但有時我們只想觀察某個 fb 的狀態，不管他是不是 ready，也就是想得到 **edge-triggered notifications**，此時我們能透過對 bitmask 做 or 運算來關注。 ```c function Poller:register(fd, r, w) local ev = self.ev[0] ev.events = bit.bor(C.EPOLLET, C.EPOLLERR, C.EPOLLHUP) if r then ev.events = bit.bor(ev.events, C.EPOLLIN) end if w then ev.events = bit.bor(ev.events, C.EPOLLOUT) end ev.data.u64 = fd local rc = C.epoll_ctl(self.fd, C.EPOLL_CTL_ADD, fd, ev) if rc < 0 then errors.get(rc):abort() end end ``` + `edge trigger` vs `level trigger`: + level trigger: 專注在**條件(ready)**，只要 fd ready ，就會**一直**提醒你。 + edge trigger: 只要有**狀態變化**才會通知你**一次**。 > 資料來源 [The method to epoll’s madness](https://copyconstruct.medium.com/the-method-to-epolls-madness-d9d2d6378642) [边缘触发(Edge Trigger)和条件触发(Level Trigger)](https://blog.csdn.net/josunna/article/details/6269235) [深入理解 Linux 的 epoll 機制](https://www.readfog.com/a/1641834490361909248)