Try   HackMD

epoll / file descripter

contributed by < eric88525 >

引述 epoll(2)

The epoll API performs a similar task to poll(2): monitoring
multiple file descriptors to see if I/O is possible on any of
them. The epoll API can be used either as an edge-triggered or a
level-triggered interface and scales well to large numbers of
watched file descriptors.

Overview

  • 概念像是請人代為監控 fd 狀態,有需要時就直接拿取 ready 的 fd
  • epoll 代表 event poll,是 linux 的特殊結構
  • 允許 process 監視多個 file descriptors ,並在 I/O 可以執行時得到提醒 (edge-triggered 和 level-triggered)
  • epoll 不是 system call,而是一種 kernel 資料結構

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

epoll 語法

1. epoll create

建立 epoll instance,此 system call 回傳 epoll instance 的 file descriptior。

  • 參數
    • size : 希望 process 監視多少個 file descriptor。在 linux 2.6.8 之後取消,改為動態決定 size
#include <sys/epoll.h>
int epoll_create(int size);

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

有令一個用法 epoll_create1(int flags), flags 可為 0 或是 EPOLL_CLOEXEC,當為 0 就跟 epoll_create 一樣

int epoll_create1(int flags);

當 flags = EPOLL_CLOEXEC,被 fork 出去的 child process 會在 exec() 以前先關掉 epoll descriptor,讓 child process 不能用 epoll instance。

2. epoll_ctl

process 可以透過此函式來增加想觀察的 file descriptor

被註冊的 fd 稱做 epoll set 或是 interest list

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

ready list 是 interest list 的子集合

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

實際用法

#include <sys/epoll.h>
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
  • 參數:

    • epfd: 透過 epoll_create() 建立的 file descriptor,用以識別 epoll instance
    • op: operation 有三種,分別是
      • 註冊: EPOLL_CTL_ADD
      • 刪除: EPOLL_CTL_DEL
      • 修改: EPOLL_CTL_MOD
    • fd: 想加入到 intererst / ready list 的 file descriptor
    • event: 指向 epoll_event structure 的指標。
  • epoll_event 包含 eventsdata

    • events 為一 bitmask,指出要觀察哪種 events,像是 (EPOLLIN, EPOLLOUT) 事件列表
    • data: epoll_data 型態,kernel 需要傳回的資料
typedef union epoll_data {
    void        *ptr;
    int          fd;
    uint32_t     u32;
    uint64_t     u64;
} epoll_data_t;

struct epoll_event {
    uint32_t     events;      /* Epoll events */
    epoll_data_t data;        /* User data variable */
};

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

3. epoll_wait

透過 epoll_wait system call,thread 可以得知在 epoll set/interest set 內哪些 event 被觸發。

#include <sys/epoll.h>
int epoll_wait(int epfd, struct epoll_event *evlist, int maxevents, int timeout);
name description
epfd 透過 epoll_create() 建立的 file descriptor,用以識別 epoll instance
evlist epoll_event 的 array,執行完 epoll_wait 後會被填充,用來得知哪些 fd 在 ready list
maxevents length of evlist
timeout block 時間(ms)

timeout

  • 0: 檢查完成後就離開,不會block
  • -1: process 永遠的等待 (sleep),直到 epoll_wait 回傳
  • >0
    : 有回傳或是時間數完才離開

return value

  • -1: 出事了 error codes
  • 0: fd 都不在 ready list
  • >0
    : 如果有只少一個 fd 在 ready list,returns the number of file descriptors ready for the requested I/O,接著就能檢查 evlist 來看哪些 fd 有事件發生。

epoll 的陷阱

file descriptors (描述符)

epoll 跟 fd 息息相關,因此需要先了解 fd

process 透過 file descriptor 來與 i/o streams 有關聯。每個 process 都有自己的 fd table,有兩個欄位 flagpointer,flag 只有一種選項 close on exec,這個後續提到。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

descriptor 可以透過 sysyem call 像是 open, pipe, socker 創建,或是透過 fork

當 process exits 或是 close 都會關閉 file descriptor,還有一種情況: 標記為 close on exec 的 descriptor 在 fork 後,只讓 parent 使用 descriptor,child process 則關閉 descriptor。

Process b 由 a fork 而來,在 b exec() 以前 descriptor 就會被標記 inactive 而無法使用。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

KERNEL 另有維護一個 open file table, 裡面記載所有被打開的 file (如果某檔案被兩個 process開,那就會有兩欄)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

當 process 被 fork 出來,descriptor 也會被複製並指向相同地方,如果更改其中一者的 offset 其他被複製出來的也會受影響 (他們是連動的)。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

innode table 介紹

innode 是 file system 的 data structure,裡面記載 file system object 的物件資訊

資訊包含:

  • location: 資料儲存在哪個 block 或是 disk
  • file 和 directory 的屬性
  • 額外的 metadata,像是 access time, owner, permissions

每個存在於檔案系統的檔案都包含 inode entry,又稱作 inode number,用以指向檔案。
而 innode table 用來紀錄 inode number 和 inode structure 的對應。

下圖表示 Process A 在打開 abc.txt 後產生 fd5,Proces B 打開同一份檔案後產生 fd10,雖然他們在 open file table 指向不同地方,但最後指向同一份檔案。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

epoll 核心

以下是 process A 打開兩個不同的檔案

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

process A 呼叫 epoll_create 建立 epoll instance,fd9 作為 file descriptor。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

透過 epoll_ctl 新增要監視的 fd, fd0 新增到 interest list。

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

如果此時 fork 出 Process B,B 繼承 A 的 fd table,就連 fd9都共享。
不論 Process A 新增啥到 interest list,Process B 都會收到提醒。

就算有 process 關閉了被 epoll 關注的檔案,他還是會收到提醒。

跟 select / poll 比較

select / poll 複雜度

O(N),每當檢查時都會全掃一遍。假設是網站的話就要把所有 client 都檢查一次。

epoll 則只需要呼叫 epoll_wait ,拿到的就都是有 event 發生的 fd。

As a result, the cost of epoll is O(number of events that have occurred) and not O(number of descriptors being monitored) as was the case with select/poll.

level trigger 條件觸發 / edge trigger 邊緣觸發

  • 條件觸發(滿足條件就產生 io事件)
  • 邊緣觸發(狀態變化時發生一個 io 事件)

預設 epoll 提供 level-triggered notifications,每當呼叫 epoll_wait,只回傳 ready list。就像下圖只回傳 [fd2, fd3]。

但有時我們只想觀察某個 fb 的狀態,不管他是不是
ready,也就是想得到 edge-triggered notifications,此時我們能透過對 bitmask 做 or 運算來關注。

function Poller:register(fd, r, w)
	local ev = self.ev[0]
	ev.events = bit.bor(C.EPOLLET, C.EPOLLERR, C.EPOLLHUP)
	if r then
		ev.events = bit.bor(ev.events, C.EPOLLIN)
	end
	if w then
		ev.events = bit.bor(ev.events, C.EPOLLOUT)
	end
	ev.data.u64 = fd
	local rc = C.epoll_ctl(self.fd, C.EPOLL_CTL_ADD, fd, ev)
	if rc < 0 then errors.get(rc):abort() end
end
  • edge trigger vs level trigger:
    • level trigger: 專注在條件(ready),只要 fd ready ,就會一直提醒你。
    • edge trigger: 只要有狀態變化才會通知你一次

資料來源
The method to epoll’s madness
边缘触发(Edge Trigger)和条件触发(Level Trigger)
深入理解 Linux 的 epoll 機制