# 2021q1 quiz11
contributed by < [Julian-Chu](https://github.com/Julian-Chu) >
###### tags: `linux2021`
> [第 11 週測驗題](https://hackmd.io/@sysprog/linux2021-quiz11)
> [GitHub](https://github.com/Julian-Chu/linux2021-quizzes/tree/main/quiz11-eventloop)
## 經典的 event loop 之一: Nodejs
![](https://csharpcorner.azureedge.net/article/node-js-event-loop/Images/1.png)
## Event Loop in quiz11
![](https://i.imgur.com/c1hyWXC.png)
### data structure
`ev` 處理整體的 event loop
`ev_entry` 代表各種 event, 包含 type 跟 callback function 資訊, 由於需要涵蓋不同的 event, 所以 `ev_entry` 內部透過 union 複用 type 跟 callback 的使用空間
`ev_entry->type` 代表我們自定義的 event
`ev_entry->type_raw` 代表 `epoll_ctl` 提供的 events bit mask
```cpp
enum {
EV_READ = (1 << 0),
EV_WRITE = (1 << 1),
EV_TIMEOUT_ONESHOT = (1 << 2),
EV_TIMEOUT_PERIODIC = (1 << 3),
EV_SIGNAL = (1 << 4),
EV_CLOEXEC = (1 << 0),
};
/*
* Forward declaration - ev objects are fully opaque to callers. Access is
* always done via access functions. Just keep in mind: ev is the object you
* usualy needs one, the main object. For each registered event like timer or
* file descriptor you corresponding ev_entry object is required.
*/
struct ev;
struct ev_entry;
struct ev {
int fd;
int break_loop;
unsigned long long entries;
/* implementation specific data, e.g. select timer handling
* will use this to store the rbtree */
void *priv_data;
};
struct ev_entry {
/* monitored FD if type is EV_READ or EV_WRITE */
int fd;
/* EV_* if raw is 0 -> type is used. E.g for
* EV_READ, EV_WRITE or EV_TIMEOUT_ONESHOT.\
* EV_RAW_* if raw is 1 -> type_raw is used then
*/
union {
int type;
uint32_t type_raw;
};
/* 0 for "old" mode, if 1 type is interpreted identical
* as epoll_ctl flags */
int raw;
/* timeout val if type is EV_TIMEOUT_ONESHOT */
struct timespec timespec;
union {
void (*fd_cb)(int, int, void *);
void (*fd_cb_raw)(int, uint32_t, void *);
void (*timer_cb_oneshot)(void *);
void (*timer_cb_periodic)(unsigned long long, void *);
void (*signal_cb)(uint32_t, uint32_t, void *);
};
/* user provided pointer to data */
void *data;
/* implementation specific data (e.g. for epoll, select) */
void *priv_data;
};
```
### Event Loop
`event loop` 的主要邏輯
```cpp
int ev_loop(struct ev *ev, int flags)
{
(void) flags;
struct epoll_event events[EVE_EPOLL_ARRAY_SIZE];
// ev 上仍有註冊的事件
while (ev->entries > 0) {
// 取得 epoll file descriptor 的 ready list
// 並將相對應的 epoll_event 取出
int nfds = epoll_wait(ev->fd, events, EVE_EPOLL_ARRAY_SIZE, -1);
if (nfds < 0) return -EINVAL;
/* multiplex and call the registerd callback handler */
for (int i = 0; i < nfds; i++) {
// 從 epoll_event.data->ptr 取得 ev_entry
struct ev_entry *ev_entry = events[i].data.ptr;
ev_process_call_internal(ev, ev_entry);
}
if (ev->break_loop) break;
}
return 0;
}
```
```cpp
struct epoll_event {
uint32_t events; /* Epoll events */
epoll_data_t data; /* User data variable */
};
typedef union epoll_data {
void *ptr;
int fd;
uint32_t u32;
uint64_t u64;
} epoll_data_t;
```
:::spoiler `man epoll_wait`
>The epoll_wait() system call waits for events on the epoll(7) instance referred to by the file descriptor epfd. The buffer pointed to by events is used to return information from the ready list about file descriptors in the interest list that have some events available. Up to maxevents are returned by epoll_wait(). The maxevents argument must be greater than zero.
>
>The timeout argument specifies the number of milliseconds that epoll_wait() will block. Time is measured against the CLOCK_MONOTONIC clock.
>
>A call to epoll_wait() will block until either:
>
>* a file descriptor delivers an event;
>* the call is interrupted by a signal handler; or
>* the timeout expires.
>
>Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount. Specifying a timeout of -1 causes epoll_wait() to block indefinitely, while specifying a timeout equal to zero cause epoll_wait() to return immediately, even if no events are available.
>
> The struct epoll_event is defined as:
>
> typedef union epoll_data {
> void *ptr;
> int fd;
> uint32_t u32;
> uint64_t u64;
> } epoll_data_t;
>
> struct epoll_event {
> uint32_t events; /* Epoll events */
> epoll_data_t data; /* User data variable */
> };
>
>The data field of each returned epoll_event structure contains
the same data as was specified in the most recent call to
epoll_ctl(2) (EPOLL_CTL_ADD, EPOLL_CTL_MOD) for the corresponding
open file descriptor.
>
>The **events** field is a **bit mask** that indicates the events that
have occurred for the corresponding open file description. See
epoll_ctl(2) for a list of the bits that may appear in this mask.
:::spoiler
EPOLLIN
The associated file is available for read(2) operations.
EPOLLOUT
The associated file is available for write(2) operations.
EPOLLRDHUP (since Linux 2.6.17)
Stream socket peer closed connection, or shut down writing
half of connection. (This flag is especially useful for
writing simple code to detect peer shutdown when using
edge-triggered monitoring.)
EPOLLPRI
There is an exceptional condition on the file descriptor.
See the discussion of POLLPRI in poll(2).
EPOLLERR
Error condition happened on the associated file
descriptor. This event is also reported for the write end
of a pipe when the read end has been closed.
epoll_wait(2) will always report for this event; it is not
necessary to set it in events when calling epoll_ctl().
EPOLLHUP
Hang up happened on the associated file descriptor.
epoll_wait(2) will always wait for this event; it is not
necessary to set it in events when calling epoll_ctl().
Note that when reading from a channel such as a pipe or a
stream socket, this event merely indicates that the peer
closed its end of the channel. Subsequent reads from the
channel will return 0 (end of file) only after all
outstanding data in the channel has been consumed.
EPOLLET
Requests edge-triggered notification for the associated
file descriptor. The default behavior for epoll is level-
triggered. See epoll(7) for more detailed information
about edge-triggered and level-triggered notification.
This flag is an input flag for the event.events field when
calling epoll_ctl(); it is never returned by
epoll_wait(2).
EPOLLONESHOT (since Linux 2.6.2)
Requests one-shot notification for the associated file
descriptor. This means that after an event notified for
the file descriptor by epoll_wait(2), the file descriptor
is disabled in the interest list and no other events will
be reported by the epoll interface. The user must call
epoll_ctl() with EPOLL_CTL_MOD to rearm the file
descriptor with a new event mask.
This flag is an input flag for the event.events field when
calling epoll_ctl(); it is never returned by
epoll_wait(2).
EPOLLWAKEUP (since Linux 3.5)
If EPOLLONESHOT and EPOLLET are clear and the process has
the CAP_BLOCK_SUSPEND capability, ensure that the system
does not enter "suspend" or "hibernate" while this event
is pending or being processed. The event is considered as
being "processed" from the time when it is returned by a
call to epoll_wait(2) until the next call to epoll_wait(2)
on the same epoll(7) file descriptor, the closure of that
file descriptor, the removal of the event file descriptor
with EPOLL_CTL_DEL, or the clearing of EPOLLWAKEUP for the
event file descriptor with EPOLL_CTL_MOD. See also BUGS.
This flag is an input flag for the event.events field when
calling epoll_ctl(); it is never returned by
epoll_wait(2).
EPOLLEXCLUSIVE (since Linux 4.5)
Sets an exclusive wakeup mode for the epoll file
descriptor that is being attached to the target file
descriptor, fd. When a wakeup event occurs and multiple
epoll file descriptors are attached to the same target
file using EPOLLEXCLUSIVE, one or more of the epoll file
descriptors will receive an event with epoll_wait(2). The
default in this scenario (when EPOLLEXCLUSIVE is not set)
is for all epoll file descriptors to receive an event.
EPOLLEXCLUSIVE is thus useful for avoiding thundering herd
problems in certain scenarios.
If the same file descriptor is in multiple epoll
instances, some with the EPOLLEXCLUSIVE flag, and others
without, then events will be provided to all epoll
instances that did not specify EPOLLEXCLUSIVE, and at
least one of the epoll instances that did specify
EPOLLEXCLUSIVE.
The following values may be specified in conjunction with
EPOLLEXCLUSIVE: EPOLLIN, EPOLLOUT, EPOLLWAKEUP, and
EPOLLET. EPOLLHUP and EPOLLERR can also be specified, but
this is not required: as usual, these events are always
reported if they occur, regardless of whether they are
specified in events. Attempts to specify other values in
events yield the error EINVAL.
EPOLLEXCLUSIVE may be used only in an EPOLL_CTL_ADD
operation; attempts to employ it with EPOLL_CTL_MOD yield
an error. If EPOLLEXCLUSIVE has been set using
epoll_ctl(), then a subsequent EPOLL_CTL_MOD on the same
epfd, fd pair yields an error. A call to epoll_ctl() that
specifies EPOLLEXCLUSIVE in events and specifies the
target file descriptor fd as an epoll instance will
likewise fail. The error in all of these cases is EINVAL.
The EPOLLEXCLUSIVE flag is an input flag for the
event.events field when calling epoll_ctl(); it is never
returned by epoll_wait(2).
:::
:::spoiler `man epoll_ctl`
>This system call is used to add, modify, or remove entries in the
interest list of the epoll(7) instance referred to by the file
descriptor epfd. It requests that the operation op be performed
for the target file descriptor, fd.
>
>Valid values for the op argument are:
>
>EPOLL_CTL_ADD
Add an entry to the interest list of the epoll file
descriptor, epfd. The entry includes the file descriptor,
fd, a reference to the corresponding open file description
(see epoll(7) and open(2)), and the settings specified in
event.
:::
\
event loop 內部根據取得的 event 會有不同的對應處理方式
```cpp
static inline void ev_process_call_internal(struct ev *ev,
struct ev_entry *ev_entry)
{
(void) ev;
if (ev_entry->raw) {
ev_entry->fd_cb_raw(ev_entry->fd, ev_entry->type_raw, ev_entry->data);
return;
}
switch (ev_entry->type) {
case EV_READ:
case EV_WRITE:
ev_entry->fd_cb(ev_entry->fd, ev_entry->type, ev_entry->data);
return;
break;
case EV_TIMEOUT_ONESHOT:
ev_process_timer_oneshot(ev, ev_entry);
break;
case EV_TIMEOUT_PERIODIC:
ev_process_timer_periodic(ev_entry);
break;
case EV_SIGNAL:
ev_process_signal(ev_entry);
break;
default:
return;
break;
}
return;
}
```
### event loop 建立與銷毀
需要注意的點是 `ev_new` 內部利用 `epoll_create1` 建立起一個 `epoll file descriptor` 用以監視所有 events 上的 fd
```cpp
void ev_destroy(struct ev *ev)
{
/* close epoll descriptor */
close(ev->fd);
/* clear potential secure data */
memset(ev, 0, sizeof(struct ev));
free(ev);
}
static inline int ev_new_flags_convert(int flags)
{
if (flags == 0) return 0;
if (flags == EV_CLOEXEC) return EPOLL_CLOEXEC;
return -EINVAL;
}
struct ev *ev_new(int flags)
{
int flags_epoll = ev_new_flags_convert(flags);
if (flags_epoll < 0) return NULL;
struct ev *ev = struct_ev_new_internal();
if (!ev) return NULL;
// 建立 epoll file descriptor
ev->fd = epoll_create1(flags_epoll);
if (ev->fd < 0) {
free(ev);
return NULL;
}
ev->entries = 0;
ev->break_loop = 0;
return ev;
}
```
:::spoiler `int epoll_create1(int flags);`
>epoll_create1()
If flags is 0, then, other than the fact that the obsolete size
argument is dropped, epoll_create1() is the same as
epoll_create(). The following value can be included in flags to
obtain different behavior:
>
>EPOLL_CLOEXEC
Set the close-on-exec (FD_CLOEXEC) flag on the new file
descriptor. See the description of the O_CLOEXEC flag in
open(2) for reasons why this may be useful.
:::
:::spoiler `man open`
>O_CLOEXEC (since Linux 2.6.23)
Enable the close-on-exec flag for the new file descriptor.
Specifying this flag permits a program to avoid additional
fcntl(2) F_SETFD operations to set the FD_CLOEXEC flag.
>
>Note that the use of this flag is essential in some
multithreaded programs, because using a separate fcntl(2)
F_SETFD operation to set the FD_CLOEXEC flag does not
suffice to avoid race conditions where one thread opens a
file descriptor and attempts to set its close-on-exec flag
using fcntl(2) at the same time as another thread does a
fork(2) plus execve(2). Depending on the order of
execution, the race may lead to the file descriptor
returned by open() being unintentionally leaked to the
program executed by the child process created by fork(2).
(This kind of race is in principle possible for any system
call that creates a file descriptor whose close-on-exec
flag should be set, and various other Linux system calls
provide an equivalent of the O_CLOEXEC flag to deal with
this problem.)
:::
### 註冊 event 到 event loop
利用 `epoll_crl` 增加或是移除在 epoll file descriptor 裡面的 fd, 特別留意的是需要針對 timer, signal 做不同的 fd 初始化
```cpp=
int ev_add(struct ev *ev, struct ev_entry *ev_entry)
{
int ret;
struct epoll_event epoll_ev;
struct ev_entry_data_epoll *ev_entry_data_epoll = ev_entry->priv_data;
memset(&epoll_ev, 0, sizeof(struct epoll_event));
if (ev_entry->raw) {
/* type is interpreted as raw epoll_ctl event, not special
* internal event, no special treatment required */
goto out;
}
switch (ev_entry->type) {
case EV_TIMEOUT_ONESHOT:
ret = ev_arm_timerfd_oneshot(ev_entry);
if (ret != 0) return -EINVAL;
break;
case EV_TIMEOUT_PERIODIC:
ret = ev_arm_timerfd_periodic(ev_entry);
if (ret != 0) return -EINVAL;
break;
case EV_SIGNAL:
ret = ev_arm_signal(ev_entry);
if (ret != 0) return -EINVAL;
break;
default:
// no special treatment of other entries
break;
}
out:
/* FIXME: the mapping must be a one to one mapping */
epoll_ev.events = ev_entry_data_epoll->flags;
epoll_ev.data.ptr = ev_entry;
ret = epoll_ctl(ev->fd, EPOLL_CTL_ADD, ev_entry->fd, &epoll_ev);
if (ret < 0) return -EINVAL;
ev->entries++;
return 0;
}
int ev_del(struct ev *ev, struct ev_entry *ev_entry)
{
struct epoll_event epoll_ev;
memset(&epoll_ev, 0, sizeof(struct epoll_event));
int ret = epoll_ctl(ev->fd, EPOLL_CTL_DEL, ev_entry->fd, &epoll_ev);
if (ret < 0) return -EINVAL;
ev->entries--;
return 0;
}
```
### 建立 event
建立註冊用的 event 有以下三種公開方法, 針對 type raw 需要提供初始化的 fd, 而 timer 跟 signal 會在 `ev_add` 內部初始化 fd
```cpp
struct ev_entry *ev_entry_new_raw(int fd,
uint32_t events,
void (*cb)(int, uint32_t, void *),
void *data)
{
struct ev_entry *ev_entry = ev_entry_new_epoll_internal();
if (!ev_entry) return NULL;
ev_entry->fd = fd;
ev_entry->type_raw = events;
ev_entry->fd_cb_raw = cb;
/* ev_entry->raw = CCC; */
ev_entry->raw = 1;
ev_entry->data = data;
struct ev_entry_data_epoll *ev_entry_data_epoll = ev_entry->priv_data;
ev_entry_data_epoll->flags = events;
return ev_entry;
}
struct ev_entry *ev_timer_oneshot_new(struct timespec *timespec,
void (*cb)(void *),
void *data)
{
struct ev_entry *ev_entry = ev_entry_new_epoll_internal();
if (!ev_entry) return NULL;
ev_entry->type = EV_TIMEOUT_ONESHOT;
ev_entry->data = data;
ev_entry->timer_cb_oneshot = cb;
/* ev_entry->raw = DDD; */
ev_entry->raw = 0;
memcpy(&ev_entry->timespec, timespec, sizeof(struct timespec));
return ev_entry;
}
struct ev_entry *ev_timer_periodic_new(struct timespec *timespec,
void (*cb)(unsigned long long, void *),
void *data)
{
struct ev_entry *ev_entry = ev_entry_new_epoll_internal();
if (!ev_entry) return NULL;
ev_entry->type = EV_TIMEOUT_PERIODIC;
ev_entry->data = data;
ev_entry->timer_cb_periodic = cb;
/* ev_entry->raw = EEE; */
ev_entry->raw = 0;
memcpy(&ev_entry->timespec, timespec, sizeof(struct timespec));
return ev_entry;
}
```
`ev_add` 內部針對 signal 跟 timer 的初始化
```cpp
static inline void ev_process_timer_oneshot(struct ev *ev,
struct ev_entry *ev_entry)
{
unsigned long long missed;
/* and now: cleanup timer specific data and
* finally all event specific data */
ssize_t ret = read(ev_entry->fd, &missed, sizeof(missed));
if (ret < 0) assert(0);
ev_del(ev, ev_entry);
/* first of all - call user callback */
ev_entry->timer_cb_oneshot(ev_entry->data);
}
static inline void ev_process_timer_periodic(struct ev_entry *ev_entry)
{
unsigned long long missed;
/* and now: cleanup timer specific data and
* finally all event specific data */
ssize_t ret = read(ev_entry->fd, &missed, sizeof(missed));
if (ret < 0) assert(0);
/* first of all - call user callback */
ev_entry->timer_cb_periodic(missed, ev_entry->data);
}
static inline void ev_process_signal(struct ev_entry *ev_entry)
{
struct signalfd_siginfo sigsiginfo;
/* and now: cleanup timer specific data and
* finally all event specific data */
ssize_t ret = read(ev_entry->fd, &sigsiginfo, sizeof(sigsiginfo));
if (ret < 0) {
assert(0);
return;
}
if (ret != sizeof(sigsiginfo)) return;
ev_entry->signal_cb(sigsiginfo.ssi_signo, sigsiginfo.ssi_pid,
ev_entry->data);
}
```
### 釋放 event 的記憶體空間
```cpp
static void ev_entry_timer_free(struct ev_entry *ev_entry)
{
close(ev_entry->fd);
}
static void ev_entry_signal_free(struct ev_entry *ev_entry)
{
close(ev_entry->fd);
}
void ev_entry_free(struct ev_entry *ev_entry)
{
if (ev_entry->raw) goto out;
switch (ev_entry->type) {
case EV_TIMEOUT_ONESHOT:
case EV_TIMEOUT_PERIODIC:
ev_entry_timer_free(ev_entry);
break;
case EV_SIGNAL:
ev_entry_signal_free(ev_entry);
break;
default:
// other events have no special cleaning
// functions. do nothing
break;
}
out:
free(ev_entry->priv_data);
memset(ev_entry, 0, sizeof(struct ev_entry));
free(ev_entry);
}
```
## Everything is a file
everything is a file (descriptor) 體現在 event loop 上,
除了一般的 file, 還讓我們可以使用 epoll 統一將 signal 跟 timer 視為 fd 處理
:::success
Todo:
- 移除 signalfd 的使用,改寫程式碼,使得能夠比照 Benchmarking libevent against libev,進行效能分析