# 2021q1 quiz11 contributed by < [Julian-Chu](https://github.com/Julian-Chu) > ###### tags: `linux2021` > [第 11 週測驗題](https://hackmd.io/@sysprog/linux2021-quiz11) > [GitHub](https://github.com/Julian-Chu/linux2021-quizzes/tree/main/quiz11-eventloop) ## 經典的 event loop 之一: Nodejs ![](https://csharpcorner.azureedge.net/article/node-js-event-loop/Images/1.png) ## Event Loop in quiz11 ![](https://i.imgur.com/c1hyWXC.png) ### data structure `ev` 處理整體的 event loop `ev_entry` 代表各種 event, 包含 type 跟 callback function 資訊, 由於需要涵蓋不同的 event, 所以 `ev_entry` 內部透過 union 複用 type 跟 callback 的使用空間 `ev_entry->type` 代表我們自定義的 event `ev_entry->type_raw` 代表 `epoll_ctl` 提供的 events bit mask ```cpp enum { EV_READ = (1 << 0), EV_WRITE = (1 << 1), EV_TIMEOUT_ONESHOT = (1 << 2), EV_TIMEOUT_PERIODIC = (1 << 3), EV_SIGNAL = (1 << 4), EV_CLOEXEC = (1 << 0), }; /* * Forward declaration - ev objects are fully opaque to callers. Access is * always done via access functions. Just keep in mind: ev is the object you * usualy needs one, the main object. For each registered event like timer or * file descriptor you corresponding ev_entry object is required. */ struct ev; struct ev_entry; struct ev { int fd; int break_loop; unsigned long long entries; /* implementation specific data, e.g. select timer handling * will use this to store the rbtree */ void *priv_data; }; struct ev_entry { /* monitored FD if type is EV_READ or EV_WRITE */ int fd; /* EV_* if raw is 0 -> type is used. E.g for * EV_READ, EV_WRITE or EV_TIMEOUT_ONESHOT.\ * EV_RAW_* if raw is 1 -> type_raw is used then */ union { int type; uint32_t type_raw; }; /* 0 for "old" mode, if 1 type is interpreted identical * as epoll_ctl flags */ int raw; /* timeout val if type is EV_TIMEOUT_ONESHOT */ struct timespec timespec; union { void (*fd_cb)(int, int, void *); void (*fd_cb_raw)(int, uint32_t, void *); void (*timer_cb_oneshot)(void *); void (*timer_cb_periodic)(unsigned long long, void *); void (*signal_cb)(uint32_t, uint32_t, void *); }; /* user provided pointer to data */ void *data; /* implementation specific data (e.g. for epoll, select) */ void *priv_data; }; ``` ### Event Loop `event loop` 的主要邏輯 ```cpp int ev_loop(struct ev *ev, int flags) { (void) flags; struct epoll_event events[EVE_EPOLL_ARRAY_SIZE]; // ev 上仍有註冊的事件 while (ev->entries > 0) { // 取得 epoll file descriptor 的 ready list // 並將相對應的 epoll_event 取出 int nfds = epoll_wait(ev->fd, events, EVE_EPOLL_ARRAY_SIZE, -1); if (nfds < 0) return -EINVAL; /* multiplex and call the registerd callback handler */ for (int i = 0; i < nfds; i++) { // 從 epoll_event.data->ptr 取得 ev_entry struct ev_entry *ev_entry = events[i].data.ptr; ev_process_call_internal(ev, ev_entry); } if (ev->break_loop) break; } return 0; } ``` ```cpp struct epoll_event { uint32_t events; /* Epoll events */ epoll_data_t data; /* User data variable */ }; typedef union epoll_data { void *ptr; int fd; uint32_t u32; uint64_t u64; } epoll_data_t; ``` :::spoiler `man epoll_wait` >The epoll_wait() system call waits for events on the epoll(7) instance referred to by the file descriptor epfd. The buffer pointed to by events is used to return information from the ready list about file descriptors in the interest list that have some events available. Up to maxevents are returned by epoll_wait(). The maxevents argument must be greater than zero. > >The timeout argument specifies the number of milliseconds that epoll_wait() will block. Time is measured against the CLOCK_MONOTONIC clock. > >A call to epoll_wait() will block until either: > >* a file descriptor delivers an event; >* the call is interrupted by a signal handler; or >* the timeout expires. > >Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount. Specifying a timeout of -1 causes epoll_wait() to block indefinitely, while specifying a timeout equal to zero cause epoll_wait() to return immediately, even if no events are available. > > The struct epoll_event is defined as: > > typedef union epoll_data { > void *ptr; > int fd; > uint32_t u32; > uint64_t u64; > } epoll_data_t; > > struct epoll_event { > uint32_t events; /* Epoll events */ > epoll_data_t data; /* User data variable */ > }; > >The data field of each returned epoll_event structure contains the same data as was specified in the most recent call to epoll_ctl(2) (EPOLL_CTL_ADD, EPOLL_CTL_MOD) for the corresponding open file descriptor. > >The **events** field is a **bit mask** that indicates the events that have occurred for the corresponding open file description. See epoll_ctl(2) for a list of the bits that may appear in this mask. :::spoiler EPOLLIN The associated file is available for read(2) operations. EPOLLOUT The associated file is available for write(2) operations. EPOLLRDHUP (since Linux 2.6.17) Stream socket peer closed connection, or shut down writing half of connection. (This flag is especially useful for writing simple code to detect peer shutdown when using edge-triggered monitoring.) EPOLLPRI There is an exceptional condition on the file descriptor. See the discussion of POLLPRI in poll(2). EPOLLERR Error condition happened on the associated file descriptor. This event is also reported for the write end of a pipe when the read end has been closed. epoll_wait(2) will always report for this event; it is not necessary to set it in events when calling epoll_ctl(). EPOLLHUP Hang up happened on the associated file descriptor. epoll_wait(2) will always wait for this event; it is not necessary to set it in events when calling epoll_ctl(). Note that when reading from a channel such as a pipe or a stream socket, this event merely indicates that the peer closed its end of the channel. Subsequent reads from the channel will return 0 (end of file) only after all outstanding data in the channel has been consumed. EPOLLET Requests edge-triggered notification for the associated file descriptor. The default behavior for epoll is level- triggered. See epoll(7) for more detailed information about edge-triggered and level-triggered notification. This flag is an input flag for the event.events field when calling epoll_ctl(); it is never returned by epoll_wait(2). EPOLLONESHOT (since Linux 2.6.2) Requests one-shot notification for the associated file descriptor. This means that after an event notified for the file descriptor by epoll_wait(2), the file descriptor is disabled in the interest list and no other events will be reported by the epoll interface. The user must call epoll_ctl() with EPOLL_CTL_MOD to rearm the file descriptor with a new event mask. This flag is an input flag for the event.events field when calling epoll_ctl(); it is never returned by epoll_wait(2). EPOLLWAKEUP (since Linux 3.5) If EPOLLONESHOT and EPOLLET are clear and the process has the CAP_BLOCK_SUSPEND capability, ensure that the system does not enter "suspend" or "hibernate" while this event is pending or being processed. The event is considered as being "processed" from the time when it is returned by a call to epoll_wait(2) until the next call to epoll_wait(2) on the same epoll(7) file descriptor, the closure of that file descriptor, the removal of the event file descriptor with EPOLL_CTL_DEL, or the clearing of EPOLLWAKEUP for the event file descriptor with EPOLL_CTL_MOD. See also BUGS. This flag is an input flag for the event.events field when calling epoll_ctl(); it is never returned by epoll_wait(2). EPOLLEXCLUSIVE (since Linux 4.5) Sets an exclusive wakeup mode for the epoll file descriptor that is being attached to the target file descriptor, fd. When a wakeup event occurs and multiple epoll file descriptors are attached to the same target file using EPOLLEXCLUSIVE, one or more of the epoll file descriptors will receive an event with epoll_wait(2). The default in this scenario (when EPOLLEXCLUSIVE is not set) is for all epoll file descriptors to receive an event. EPOLLEXCLUSIVE is thus useful for avoiding thundering herd problems in certain scenarios. If the same file descriptor is in multiple epoll instances, some with the EPOLLEXCLUSIVE flag, and others without, then events will be provided to all epoll instances that did not specify EPOLLEXCLUSIVE, and at least one of the epoll instances that did specify EPOLLEXCLUSIVE. The following values may be specified in conjunction with EPOLLEXCLUSIVE: EPOLLIN, EPOLLOUT, EPOLLWAKEUP, and EPOLLET. EPOLLHUP and EPOLLERR can also be specified, but this is not required: as usual, these events are always reported if they occur, regardless of whether they are specified in events. Attempts to specify other values in events yield the error EINVAL. EPOLLEXCLUSIVE may be used only in an EPOLL_CTL_ADD operation; attempts to employ it with EPOLL_CTL_MOD yield an error. If EPOLLEXCLUSIVE has been set using epoll_ctl(), then a subsequent EPOLL_CTL_MOD on the same epfd, fd pair yields an error. A call to epoll_ctl() that specifies EPOLLEXCLUSIVE in events and specifies the target file descriptor fd as an epoll instance will likewise fail. The error in all of these cases is EINVAL. The EPOLLEXCLUSIVE flag is an input flag for the event.events field when calling epoll_ctl(); it is never returned by epoll_wait(2). ::: :::spoiler `man epoll_ctl` >This system call is used to add, modify, or remove entries in the interest list of the epoll(7) instance referred to by the file descriptor epfd. It requests that the operation op be performed for the target file descriptor, fd. > >Valid values for the op argument are: > >EPOLL_CTL_ADD Add an entry to the interest list of the epoll file descriptor, epfd. The entry includes the file descriptor, fd, a reference to the corresponding open file description (see epoll(7) and open(2)), and the settings specified in event. ::: \ event loop 內部根據取得的 event 會有不同的對應處理方式 ```cpp static inline void ev_process_call_internal(struct ev *ev, struct ev_entry *ev_entry) { (void) ev; if (ev_entry->raw) { ev_entry->fd_cb_raw(ev_entry->fd, ev_entry->type_raw, ev_entry->data); return; } switch (ev_entry->type) { case EV_READ: case EV_WRITE: ev_entry->fd_cb(ev_entry->fd, ev_entry->type, ev_entry->data); return; break; case EV_TIMEOUT_ONESHOT: ev_process_timer_oneshot(ev, ev_entry); break; case EV_TIMEOUT_PERIODIC: ev_process_timer_periodic(ev_entry); break; case EV_SIGNAL: ev_process_signal(ev_entry); break; default: return; break; } return; } ``` ### event loop 建立與銷毀 需要注意的點是 `ev_new` 內部利用 `epoll_create1` 建立起一個 `epoll file descriptor` 用以監視所有 events 上的 fd ```cpp void ev_destroy(struct ev *ev) { /* close epoll descriptor */ close(ev->fd); /* clear potential secure data */ memset(ev, 0, sizeof(struct ev)); free(ev); } static inline int ev_new_flags_convert(int flags) { if (flags == 0) return 0; if (flags == EV_CLOEXEC) return EPOLL_CLOEXEC; return -EINVAL; } struct ev *ev_new(int flags) { int flags_epoll = ev_new_flags_convert(flags); if (flags_epoll < 0) return NULL; struct ev *ev = struct_ev_new_internal(); if (!ev) return NULL; // 建立 epoll file descriptor ev->fd = epoll_create1(flags_epoll); if (ev->fd < 0) { free(ev); return NULL; } ev->entries = 0; ev->break_loop = 0; return ev; } ``` :::spoiler `int epoll_create1(int flags);` >epoll_create1() If flags is 0, then, other than the fact that the obsolete size argument is dropped, epoll_create1() is the same as epoll_create(). The following value can be included in flags to obtain different behavior: > >EPOLL_CLOEXEC Set the close-on-exec (FD_CLOEXEC) flag on the new file descriptor. See the description of the O_CLOEXEC flag in open(2) for reasons why this may be useful. ::: :::spoiler `man open` >O_CLOEXEC (since Linux 2.6.23) Enable the close-on-exec flag for the new file descriptor. Specifying this flag permits a program to avoid additional fcntl(2) F_SETFD operations to set the FD_CLOEXEC flag. > >Note that the use of this flag is essential in some multithreaded programs, because using a separate fcntl(2) F_SETFD operation to set the FD_CLOEXEC flag does not suffice to avoid race conditions where one thread opens a file descriptor and attempts to set its close-on-exec flag using fcntl(2) at the same time as another thread does a fork(2) plus execve(2). Depending on the order of execution, the race may lead to the file descriptor returned by open() being unintentionally leaked to the program executed by the child process created by fork(2). (This kind of race is in principle possible for any system call that creates a file descriptor whose close-on-exec flag should be set, and various other Linux system calls provide an equivalent of the O_CLOEXEC flag to deal with this problem.) ::: ### 註冊 event 到 event loop 利用 `epoll_crl` 增加或是移除在 epoll file descriptor 裡面的 fd, 特別留意的是需要針對 timer, signal 做不同的 fd 初始化 ```cpp= int ev_add(struct ev *ev, struct ev_entry *ev_entry) { int ret; struct epoll_event epoll_ev; struct ev_entry_data_epoll *ev_entry_data_epoll = ev_entry->priv_data; memset(&epoll_ev, 0, sizeof(struct epoll_event)); if (ev_entry->raw) { /* type is interpreted as raw epoll_ctl event, not special * internal event, no special treatment required */ goto out; } switch (ev_entry->type) { case EV_TIMEOUT_ONESHOT: ret = ev_arm_timerfd_oneshot(ev_entry); if (ret != 0) return -EINVAL; break; case EV_TIMEOUT_PERIODIC: ret = ev_arm_timerfd_periodic(ev_entry); if (ret != 0) return -EINVAL; break; case EV_SIGNAL: ret = ev_arm_signal(ev_entry); if (ret != 0) return -EINVAL; break; default: // no special treatment of other entries break; } out: /* FIXME: the mapping must be a one to one mapping */ epoll_ev.events = ev_entry_data_epoll->flags; epoll_ev.data.ptr = ev_entry; ret = epoll_ctl(ev->fd, EPOLL_CTL_ADD, ev_entry->fd, &epoll_ev); if (ret < 0) return -EINVAL; ev->entries++; return 0; } int ev_del(struct ev *ev, struct ev_entry *ev_entry) { struct epoll_event epoll_ev; memset(&epoll_ev, 0, sizeof(struct epoll_event)); int ret = epoll_ctl(ev->fd, EPOLL_CTL_DEL, ev_entry->fd, &epoll_ev); if (ret < 0) return -EINVAL; ev->entries--; return 0; } ``` ### 建立 event 建立註冊用的 event 有以下三種公開方法, 針對 type raw 需要提供初始化的 fd, 而 timer 跟 signal 會在 `ev_add` 內部初始化 fd ```cpp struct ev_entry *ev_entry_new_raw(int fd, uint32_t events, void (*cb)(int, uint32_t, void *), void *data) { struct ev_entry *ev_entry = ev_entry_new_epoll_internal(); if (!ev_entry) return NULL; ev_entry->fd = fd; ev_entry->type_raw = events; ev_entry->fd_cb_raw = cb; /* ev_entry->raw = CCC; */ ev_entry->raw = 1; ev_entry->data = data; struct ev_entry_data_epoll *ev_entry_data_epoll = ev_entry->priv_data; ev_entry_data_epoll->flags = events; return ev_entry; } struct ev_entry *ev_timer_oneshot_new(struct timespec *timespec, void (*cb)(void *), void *data) { struct ev_entry *ev_entry = ev_entry_new_epoll_internal(); if (!ev_entry) return NULL; ev_entry->type = EV_TIMEOUT_ONESHOT; ev_entry->data = data; ev_entry->timer_cb_oneshot = cb; /* ev_entry->raw = DDD; */ ev_entry->raw = 0; memcpy(&ev_entry->timespec, timespec, sizeof(struct timespec)); return ev_entry; } struct ev_entry *ev_timer_periodic_new(struct timespec *timespec, void (*cb)(unsigned long long, void *), void *data) { struct ev_entry *ev_entry = ev_entry_new_epoll_internal(); if (!ev_entry) return NULL; ev_entry->type = EV_TIMEOUT_PERIODIC; ev_entry->data = data; ev_entry->timer_cb_periodic = cb; /* ev_entry->raw = EEE; */ ev_entry->raw = 0; memcpy(&ev_entry->timespec, timespec, sizeof(struct timespec)); return ev_entry; } ``` `ev_add` 內部針對 signal 跟 timer 的初始化 ```cpp static inline void ev_process_timer_oneshot(struct ev *ev, struct ev_entry *ev_entry) { unsigned long long missed; /* and now: cleanup timer specific data and * finally all event specific data */ ssize_t ret = read(ev_entry->fd, &missed, sizeof(missed)); if (ret < 0) assert(0); ev_del(ev, ev_entry); /* first of all - call user callback */ ev_entry->timer_cb_oneshot(ev_entry->data); } static inline void ev_process_timer_periodic(struct ev_entry *ev_entry) { unsigned long long missed; /* and now: cleanup timer specific data and * finally all event specific data */ ssize_t ret = read(ev_entry->fd, &missed, sizeof(missed)); if (ret < 0) assert(0); /* first of all - call user callback */ ev_entry->timer_cb_periodic(missed, ev_entry->data); } static inline void ev_process_signal(struct ev_entry *ev_entry) { struct signalfd_siginfo sigsiginfo; /* and now: cleanup timer specific data and * finally all event specific data */ ssize_t ret = read(ev_entry->fd, &sigsiginfo, sizeof(sigsiginfo)); if (ret < 0) { assert(0); return; } if (ret != sizeof(sigsiginfo)) return; ev_entry->signal_cb(sigsiginfo.ssi_signo, sigsiginfo.ssi_pid, ev_entry->data); } ``` ### 釋放 event 的記憶體空間 ```cpp static void ev_entry_timer_free(struct ev_entry *ev_entry) { close(ev_entry->fd); } static void ev_entry_signal_free(struct ev_entry *ev_entry) { close(ev_entry->fd); } void ev_entry_free(struct ev_entry *ev_entry) { if (ev_entry->raw) goto out; switch (ev_entry->type) { case EV_TIMEOUT_ONESHOT: case EV_TIMEOUT_PERIODIC: ev_entry_timer_free(ev_entry); break; case EV_SIGNAL: ev_entry_signal_free(ev_entry); break; default: // other events have no special cleaning // functions. do nothing break; } out: free(ev_entry->priv_data); memset(ev_entry, 0, sizeof(struct ev_entry)); free(ev_entry); } ``` ## Everything is a file everything is a file (descriptor) 體現在 event loop 上, 除了一般的 file, 還讓我們可以使用 epoll 統一將 signal 跟 timer 視為 fd 處理 :::success Todo: - 移除 signalfd 的使用,改寫程式碼,使得能夠比照 Benchmarking libevent against libev,進行效能分析