2022q1 Homework6 (ktcp)

# 2022q1 Homework6 (ktcp) contributed by < [cantfindagoodname](https://github.com/cantfindagoodname) > ### How would LKM take port=XXXX as argument ```shell $ grep -r port * ... kecho_mod.c:static ushort port = DEFAULT_PORT kecho_mod.c:module_param(port, ushort, S_IRUGO); ... ``` We can see that the module would call a macro `module_param()`, which takes 3 parameters. The definition should be `module_param(name, type, permission)` > In [include/linux/stat.h](https://github.com/torvalds/linux/blob/master/include/uapi/linux/stat.h) > ```c > #define S_IRUGO (S_IRUSR|S_IRGRP|S_IROTH) > ``` > In [include/uapi/linux/stat.h](https://github.com/torvalds/linux/blob/master/include/linux/stat.h) > ```c > #define S_IRUSR 00400 > #define S_IRGRP 00040 > #define S_IROTH 00004 > ``` > `S_IRUGO` permits file read for all user accounts. > Code to trace : [gist](https://gist.github.com/cantfindagoodname/5bb113827b940057b08b254fa937a618) > Reference : [hackmd](https://hackmd.io/@sysprog/linux-kernel-module#Linux-%E6%A0%B8%E5%BF%83%E6%A8%A1%E7%B5%84%E6%8E%9B%E8%BC%89%E6%A9%9F%E5%88%B6) ```shell $ modinfo arg.ko ... parm: arg:int ``` The `modinfo` is registered by `module_param(...)`, which would expand to call `module_param_call(name, _set, _get, arg, perm)`, and `__MODULE_INFO(parmtype, name##type, #name ":" _type)` For predefined types, helper functions `param_set_<type>(val, ...)` and `param_get_<type>(buf, ...)` for each types are defined. They would be provided for `module_param_call(...)` as function pointer, which are how the values are assigned to the argument. ```c /** * ... * Standard types are: * byte, hexint, short, ushort, int, uint, long, ulong * charp: a character pointer * bool: a bool, values 0/1, y/n, Y/N. * invbool: the above, only sense-reversed (N = true). */ ``` At this point, the information needed (`name`, `ops` <- `type`) are saved as a structure `struct kernel_param`, the program would then assign the value to a global variable with the same name. After `insmod` command are issued, `load_module(...)` would be called. > `load_module(...)` is defined in [kernel/module.c](https://github.com/torvalds/linux/blob/master/kernel/module.c) After `load_module(...)` is called, `parse_args(...)` would be called. > `parse_args(...)` is defined in [kernel/params.c](https://github.com/torvalds/linux/blob/master/kernel/params.c) > > Documentation : > /* Args looks like "foo=bar,bar2 baz=fuz wiz". */ > Which are the format used to designate our argument `insmod arg=val` After `parse_args(...)` is called, `parse_one(...)` would be called. In `parse_one(...)`, we could find an expression ```c if (parameq(param, parms[i].name)) { /** * param : argument found by parser * params[i] : symbol saved by module_param(...) */ ... err = params[i].ops->set(val, &params[i]); ... } ``` ### How is kHTTPd similar to socket interface introduced by [CS:APP](https://csapp.cs.cmu.edu/) > Code used as referenced is from [lab-0](https://hackmd.io/@sysprog/linux2022-lab0#%E6%95%B4%E5%90%88-tiny-web-server) / [notes](https://hackmd.io/@cantfindagoodname/linux2022-lab0#%E5%9C%A8-qtest-%E6%8F%90%E4%BE%9B%E6%96%B0%E7%9A%84%E5%91%BD%E4%BB%A4-web) > [Github](https://github.com/cantfindagoodname/tiny-web-server) > > Which originated from [tiny.tar](http://csapp.cs.cmu.edu/3e/tiny.tar) of the official CS:APP website kHTTPd takes a identical procedure in its API, where some functions corresponds to functions in [tiny.c](https://github.com/cantfindagoodname/tiny-web-server/blob/master/tiny.c) ```shell tiny.c <-> kHTTPd (main.c) open_listenfd() - open_listen_socket() socket() - socket_create() setsockopt() - setsockopt() bind() - kernelbind() listen() - kernel_listen() tiny.c <-> kHTTPd (http_server.c) accept() - kernel_accept() process() - http_server_worker() # Do stuffs tiny.c <-> kHTTPd (main.c) close() - close_listen_socket() ``` The details of accepting connections is documented in notes of `man 2 listen` > To accept connections, the following steps are performed: > >1. A socket is created with socket(2). > >2. The socket is bound to a local address using bind(2), so that other sockets may be connect(2)ed to it. > >3. A willingness to accept incoming connections and a queue limit for incoming connections are specified with listen(). > >4. Connections are accepted with accept(2). :::spoiler ```c /* tiny.c */ int main(int argc, char** argv) { ... listenfd = open_listenfd(default_port); ... while(1) { connfd = accept(listenfd, (SA *)&clientaddr, &clientlen); process(connfd, &clientaddr); close(connfd); } return 0; } int open_listenfd(int port) { ... /* Create a socket descriptor */ if ((listenfd = socket(AF_INET, SOCK_STREAM, 0)) < 0) { return -1; } /* Eliminates "Address already in use" error from bind. */ if (setsockopt(listenfd, SOL_SOCKET, SO_REUSEADDR, (const void *)&optval , sizeof(int)) < 0) { return -1; } // 6 is TCP's protocol number // enable this, much faster : 4000 req/s -> 17000 req/s if (setsockopt(listenfd, 6, TCP_CORK, (const void *)&optval , sizeof(int)) < 0) { return -1; } ... if (bind(listenfd, (SA *)&serveraddr, sizeof(serveraddr)) < 0) { return -1; } /* Make it a listening socket ready to accept connection requests */ if (listen(listenfd, LISTENQ) < 0) { return -1; } ... } ``` ::: There are several difference between kHTTPd and tiny-web-server : 1. kHTTPd is a kernel module, and tiny-web-server is a program in user space. 2. How they handle multiple requests In the original [tiny-web-server](http://csapp.cs.cmu.edu/3e/tiny.tar), they would `fork()` for processes and let them each listen to the same port. In kHTTPd, the worker assigned job with `kthread_run(http_server_worker, ...)`, every time the client issue a new request, a thread were created. ### Flaws in kHTTPd I wasn't able to reproduce the bug stated in [requirements](https://hackmd.io/@sysprog/linux2022-ktcp#tip1), however there was another bug shown by `dmesg`. ```shell BUG: unable to handle page fault for address: ffffffffc0e6f595 #PF: supervisor instruction fetch in kernel mode #PF: error_code(0x0010) - not-present page PGD 168215067 P4D 168215067 PUD 168217067 PMD 10ce07067 PTE 0 Oops: 0010 [#2] SMP NOPTI CPU: 0 PID: 109667 Comm: khttpd Tainted: P D OE 5.13.0-40-generic #45~20.04.1-Ubuntu Hardware name: ASUSTeK COMPUTER INC. ASUS D700TA_S700TA/D700TA, BIOS D700TA.305 02/24/2021 RIP: 0010:0xffffffffc0e6f595 Code: Unable to access opcode bytes at RIP 0xffffffffc0e6f56b. RSP: 0000:ffffc162435bfd68 EFLAGS: 00010296 RAX: 0000000000000000 RBX: ffffffffc0e6f5b0 RCX: 0000000000000218 RDX: 0000000000000000 RSI: 00000000fffffe01 RDI: ffffffffae9a4eef RBP: ffffc162435bfde0 R08: 0000000000000000 R09: 0000000000000000 R10: ffff9de4f1650990 R11: 0000000000000000 R12: ffff9de576483000 R13: ffff9de54cab1d40 R14: ffffc16240f53e18 R15: ffff9de54cab1d40 FS: 0000000000000000(0000) GS:ffff9de7ce600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffc0e6f56b CR3: 0000000168210001 CR4: 00000000007706f0 PKRU: 55555554 Call Trace: <TASK> ? kthread+0x128/0x150 ? set_kthread_struct+0x40/0x40 ? ret_from_fork+0x1f/0x30 </TASK> Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 lib crc32c bpfilter br_netfilter bridge stp llc rfcomm aufs overlay cmac algif_hash algif_skcipher af_alg bnep nls_iso8859_1 nvidia_uvm(PO) nvidia_drm(PO) nvidia_modeset(PO) intel_tcc_cooling nvidia(PO) x86_p kg_temp_thermal intel_powerclamp coretemp snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel soundwire_generic_allocation soundwire_cadence snd_sof_intel_hda snd_sof_pci snd_sof_xtensa_dsp snd _sof snd_soc_hdac_hda snd_hda_ext_core snd_soc_acpi_intel_match snd_hda_codec_realtek snd_soc_acpi soundwire_bus snd_hda_codec_generic ledtrig_audio snd_soc_core snd_hda_codec_hdmi snd_compress ac97_bus s nd_pcm_dmaengine mei_hdcp intel_rapl_msr snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi kvm_intel snd_hda_codec snd_hda_core snd_usb_audio kvm snd_usbmidi_lib snd_hwdep snd_pcm crct10dif_pclmul ghash_clmulni_intel snd_seq_midi snd_seq_midi_event aesni_intel snd_rawmidi crypto_simd snd_seq cryptd btusb btrtl rapl btbcm uvcvideo intel_cstate snd_seq_device btintel drm_kms_helper snd_timer videobuf2_vmalloc bluetooth videobuf2_memops videobuf2_v4l2 processor_thermal_device videobuf2_common cec snd processor_thermal_rfim rc_core videodev processor_thermal_mbox proces sor_thermal_rapl fb_sys_fops intel_rapl_common mei_me syscopyarea joydev asus_nb_wmi sysfillrect mc ecdh_generic int3403_thermal input_leds wmi_bmof efi_pstore ecc ee1004 sysimgblt soundcore mei intel_soc _dts_iosf sch_fq_codel int340x_thermal_zone ipmi_devintf ipmi_msghandler msr int3400_thermal mac_hid acpi_thermal_rel acpi_pad acpi_tad parport_pc ppdev lp drm parport ip_tables x_tables autofs4 wacom hid _generic usbhid hid mfd_aaeon asus_wmi sparse_keymap crc32_pclmul nvme e1000e nvme_core i2c_i801 ahci xhci_pci i2c_smbus libahci xhci_pci_renesas wmi video pinctrl_sunrisepoint [last unloaded: khttpd] CR2: ffffffffc0e6f595 ---[ end trace 32ce5b26edd57ca9 ]--- ``` This is shown after the following procedure is made : 1. raise a request with browser 2. rmmod 3. exit the browser This reason is very likely caused by the worker thread can't stop properly, as it is created by `http_server_daemon` and the parent thread is stopped when `rmmod` is issued. Hence, the pagefault. ### epoll and hstress.c The I/O Multiplexing Model all performs similar task, as shown : In description of `man epoll` > The epoll API performs a similar task to poll(2) In description of `man poll` > poll() performs a similar task to select(2) Notice that `epoll(7)` is at section 7 of `man`, `poll(2)` and `select(2)` is at section 2 of `man`. That means `poll` and `select` is system call conforming to POSIX standards, and `epoll` would be API that is Linux-specific. #### `select` ( Supplementing [lab-0](https://hackmd.io/@cantfindagoodname/linux2022-lab0#%E5%9C%A8-qtest-%E6%8F%90%E4%BE%9B%E6%96%B0%E7%9A%84%E5%91%BD%E4%BB%A4-web) ) Like all I/O Multiplexing Models, `select` monitors all client activity at the same time. There is an inbuilt data structure for socket file descriptor `fd_set`, in which would be monitored by `select`. `select` is a blocking system call waiting for file descriptors for a request. > Cite : OpenBSD [sys/select.h](https://github.com/openbsd/src/blob/master/sys/sys/select.h) > glibc (implementation on my machine) has them scatter over multiple files `select` operates on 4 macros (functions), which are : ```c FD_ZERO(fd_set *set); /* Initialze fd_set to 0 */ FD_SET(int fd, fd_set *set); /* Set bit at bitmask fd */ FD_CLR(int fd, fd_set *set); /* Clear bit at bitmask fd */ FD_ISSET(int fd, fd_set *set); /* Check if bit in fd is set */ ``` Basically, `select` operates on bitmasks on fd to check if there is a request for each fd. One point to note though is the data structure `fd_set` ```c #ifndef FD_SETSIZE #define FD_SETSIZE 1024 #endif typedef struct fd_set { __fd_mask fds_bits[__howmany(FD_SETSIZE, __NFDBITS)]; } fd_set; /* __howmany determines how many __fd_mask is needed for FD_SETSIZE */ ``` Why is `fd_set` fixed sized : POSIX allows implentation that defines upper limit. As there is a design error for `select` `fd_set` object must reintialized on each call to `select`, and `select` would check for all specified file descriptors in all three `readfds`, `writefds`, `exceptfds` up to `nfds - 1`. Which could be a severe issue regarding efficiency in time/memory, and security (need an addition error code). > Cite : `man select`- **BUG** > This design error is avoided by poll(2) #### poll `poll` is a system call that perform similar task to `select`. However, it does not have the problem regarding the upper limit for file descriptors. When multiplexing file descriptors, `select` would use bitmask to represent each file descriptors, which each of the `1024` bits is a file descriptor. Wheares in `poll`, the data structure compactly represents only file descriptors of interest with an array. One other advantages `poll` has over `select` is the data structure do not have to reinitialze every system call. For `select`, every time before the system call, we would need to reinitialize `fd_set` with `FD_ZERO` and `FD_SET`, as the active file descriptor is chosen with `FD_ISSET`. `poll` in the other hand, does not destory the original input data struture, as the data structure seperates the event field with fd field. ```c struct pollfd { int fd; /* file descriptor */ short events; /* requested events */ short revents; /* returned events */ }; ``` The active file descriptor would be known with `revents` field, unlike `FD_ISSET`, which destroy the whole `fd_set` structure. #### epoll Some problem may arise when using `select` or `poll`. Consider a model with a large number of connections. Not all connections will be active all the time, in fact, it is very likely to have only few active connection from all the connections. `select` and `poll` will both pass all the connections (`fd_set` or `struct pollfd`) to the kernel to filter out the active connections (state-based). `epoll` is a mechanism designed for this scenario. The concept is to seperate the file descriptors to `interest` and `ready` list. `interest` list being set of all connection registered for `epoll`. `ready` list being subset of `interest` containing the active connections (events) Then, `epoll` only manage the active connections instead of filter out those non-active connections. `epoll` operates on 3 functions : ```c int epoll_create(int size); /* epoll_create1() as extension */ int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout); ``` `epoll_create()` would create a container for an `epoll` instance, which contains both `interest` and `ready` list, returning the "file descriptor" of `epoll`. `epoll_ctl()` would add file descriptors into the `interest` list, which establish connections for the `epoll` structure. `epoll_wait()` would wait for I/O events, fetch them from the `ready` list to a `struct epoll_event` structure, or blocks when there are no items in `interest` or `ready` list. There are also multiple behavior for `epoll` event, `level-triggered` and `edge-triggered` `man epoll` describes few difference between them. ##### level / edge triggered `level-triggered` would continuously notify when it detects a new event, until all of buffer is processed. `edge-triggered` would notify everytime a new event is detected. That said, when processing data that isn't possible to consume in one go for `edge-triggered` behavior, In `level-triggered` behavior, the program would complete consumption of data, whereas `edge-triggered` behaviour would let the program hangs or `errno = EAGAIN`, if set as non-blocking Another notable point from `man epoll`: When multiple threads are blocked by `epoll_wait()`, when a file descriptor become ready, and the behavior is `edge-triggered`, only one and exactly one thread will be waken up. #### performance comparison #### htstress ```shell main : 1. setup address family (AF) 2. getopt (command line arguments) 3. get start time (get end time in worker) 4. run test 5. output ``` Note then when running test, ```c for (int n = 0; n < num_threads - 1; ++n) pthread_create(&useless_thread, 0, &worker, 0); worker(0); ``` The program create `num_threads - 1` threads to invoke `worker`, then invoke once again with the main thread. Not sure why it is written like this though, my best guess is to stop main thread from hogging resource when waiting for other thread. > TODO : check if `pthread_join()` hold resource while waiting ```shell worker : 1. epoll_create() 2. init_conn - fd = socket(), fcntl(fd) - connect(fd) (non-blocking, loop if EAGAIN) - epoll_ctl() 3. epoll_wait() (block until event) 4. handle event - EPOLLOUT 1. send(fd) 2. schedule read - EPOLLIN 1. recv(fd) L size > inbuf - record offset, continue reading 2. add n request - not sure why to init_conn though # line 304 3. loop if request does not meet max_request - 1 ``` ### kecho and cmwq Document : [kernel.org](https://www.kernel.org/doc/html/v4.10/core-api/workqueue.html) A workqueue is a convenient way to defer a work to be done in process context at future time. For each multithreaded work queue, there would be one worker thread per CPU (as execution context). When the user create a new workqueue, for a machine with large number of logical CPU, the system would easily saturate the PID space just with the worker threads. The level of concurrency provided by normal multithreaded workqueue were unsatisfactory albeit hogging a lot of resource. When number of worker threads are contending resources (CPU) increases, the number of unnecesseary context switches would happen. As context switches are not without a cost, this could cause degration in performance. The mechanism of having dedicated threads associated with each workqueue would also be the reason of worse performance. As when a task are blocked for some reason, the entire workqueue has to wait for the task. #### CMWQ CMWQ introduces a design that differentiates user-facing workqueues and backend mechanism that manages worker-pools. Meaning that all the user have to worry about is to create workqueue and queue work item into the workqueue. The kernel will manage the rest (such as concurrency). That said, in CMWQ, there are no threads dedicated to any specific work queue. Meaning that each work item submitted to the same work queue may execute in different execution context, unless specified. Each CPU only has one shared pool of worker threads (in respect to one worker thread per workqueue). Meaning there would be far fewer number of kernel threads resides in the background systemwise. (scaling : #cpu * *max_active* vs #workqueues) CMWQ also introduce a mechanism which allow task running under blocked and ready state. Other tasks would wait for current task until it is blocked, and new work would be started only when there are runnable workers on the CPU. To quote the [documentation](https://www.kernel.org/doc/html/v4.10/core-api/workqueue.html#the-design), > cmwq tries to keep the concurrency at a minimal but sufficient level #### CMWQ API One of the three goals of CMWQ is backwards compatibility, where they stated in the [documentation](https://www.kernel.org/doc/html/v4.10/core-api/workqueue.html#why-cmwq) > Maintain compatibility with the original workqueue API. And so, most of the functions were carried over from the orignal workqueue API, which an example of the API could be seen in [LKMPG](https://sysprog21.github.io/lkmpg/#scheduling-tasks) Workqueues were originally created with ```c struct workqueue_struct *create_workqueue(name) ``` However, it was deprecated which would be explained later on. Which is replaced by ```c struct workqueue_struct *alloc_workqueue(fmt, flags, max_active, args...) ``` The function would allocate a workqueue given the `WQ_*` flags The workqueue structure could either be declared in compile time with ```c DECLARE_WORK(name, (*fpt)(), data) ``` or in runtime with ```c INIT_WORK(work_item, (*fpt)(*), data) PREPARE_WORK(work_item, (*fpt)(*), data) ``` `INIT_WORK` would initialize the linked lists within the work item, whereas `PREPARE_WORK` changes only the function pointer and data, and could be use even if the work item is in a workqueue. To actually queue the work item, we use the function ```c int queue_work(work_queue, work_item) ``` The work item could also be queued into the global/default workqueue even if they don't need a predefined workqueue, using ```c int schedule_work(work_item) ``` Note that both `queue_work` and `schedule_work` has a delayed variation `*_delayed_work`, which would ensures a minimum delay in jiffies before actually queueing the work. Then, to ensure every scheduled work has run to completion, use ```c /* Deprecated */ void flush_scheduled_work() /* Replacement, also deprecated */ bool flush_work_sync(work_item) /* Replacement */ bool flush_work(work_item) ``` There are also a function that flush and deletes a workqueue that are not specified in the [documentation](https://www.kernel.org/doc/html/v4.10/core-api/workqueue.html#application-programming-interface-api) ```c void destroy_workqueue(work_queue) ``` #### flags `WQ_UNBOUND` The work items in the workqueue is not bounded to a specific CPU. The worker-pools try to start the execution as soon as possible but it sacrifices locality. `WQ_FREEZABLE` As Stated in the [documentation](https://www.kernel.org/doc/html/v4.10/core-api/workqueue.html#application-programming-interface-api) > A freezable wq participates in the freeze phase of the system suspend operations. Work items on the wq are drained and no new work item starts execution until thawed. I've yet to think of a scenario where this could came in useful though. `WQ_MEM_RECLAIM` This ensures that the workqueue would have at least one rescue-worker, reserved for execution in tightly resource constrained system. Say, when an existing thread are waiting for other task that yet to execute, the program would be jammed in the state, until the other task magically executed by itself. The rescuer-worker are designed for this kind of situation, they are summoned from the worker-pool to execute the task and hopefully progress the worker. `WQ_HIGHPRI` There would be 2 worker-pool bound to each CPU, one with higher priority than other. This ensure a higher priority when the kernel schedules the worker thread with its scheduler. `WQ_CPU_INTENSIVE` Used for work items in the workqueue do not contributes to the concurrency level. Meaning this workqueue can hogs the CPU all it wants, other workqueue might need to find another worker-pool to run. #### CMWQ in kecho ```c struct workqueue_struct *kecho_wq; /* init module */ kecho_wq = alloc_workqueue(MODULE_NAME, bench ? 0 : WQ_UNBOUND, 0); ``` Although it was not stated what will happen when the flags are given 0, from its [implementation](https://github.com/torvalds/linux/blob/master/kernel/workqueue.c) I assume that `flag = 0` would get pass all the setup for `WQ_UNBOUND` flag, which sacrifises locality as the work item might skips between multiple CPUs. `flag = (WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM | WQ_HIGHPRI)` yields similar result when benchmarking. Whereas when the `flag = WQ_UNBOUND`, the result became fairly inconsistent, as the upper bound of runtime increases. ```c /* echo_server_daemon() */ INIT_LIST_HEAD(&daemon.worker); /* ... */ work = create_work(sock); /* ... */ queue_work(kecho_wq, work); /* create_work() */ INIT_WORK(&work->kecho_work, echo_server_worker); list_add(&work->list, &daemon.worker); return &work->kecho_work; ``` kecho would maintain a list of worker, then queue the work to workqueue. The worker are expected to be accessed using `container_of` or `offset_of` ```c /* free_work */ list_for_each_entry_safe(..., &daemon.worker, list) flush_work(&_->kecho_work); /* kecho_cleanup_module */ destroy_workqueue(kecho_wq); ``` Then the workqueue are destroyed in the cleanup module. #### create_*workqueue deprecated > cite : [working on workqueues](https://lwn.net/Articles/403891/) > In kernels prior to `2.6.36`, workqueues are created with `create_workqueue()` and a couple of variants. That function will, among other things, start up one or more kernel threads to handle tasks submitted to that workqueue. In `2.6.36`, that interface has been preserved, but the workqueue it creates is a different beast: it has no dedicated threads and really just serves as a context for the submission of tasks. The API is considered deprecated; the proper way to create a workqueue now is with: > ```c > int alloc_workqueue(char *name, unsigned int flags, int max_active); > ``` [commit d320c03830](https://github.com/torvalds/linux/commit/d320c03830b17af64e4547075003b1eeb274bc6c) > This patch makes changes to make new workqueue features available to its users. > * Now that workqueue is more featureful, there should be a public workqueue creation function which takes paramters to control them. Rename `__create_workqueue()` to `alloc_workqueue()` and make 0 `max_active` mean `WQ_DFL_ACTIVE`. In the long run, all `create_workqueue_*()` will be converted over to `alloc_workqueue()`. prototype of `create_workqueue` ```c struct workqueue_struct *create_workqueue(const char *name); ``` We can see exactly the reason on why the `create_workqueue` is remade, as in version right before `2.6.36`, such as `2.6.15`: [commit 9fe6206f40](https://github.com/torvalds/linux/blob/9fe6206f400646a2322096b56c59891d530e8d51/include/linux/workqueue.h), macros is used to represent each flags instead of passing arguments. ```c #define create_workqueue(name) __create_workqueue((name), 0, 0, 0) #define create_rt_workqueue(name) __create_workqueue((name), 0, 0, 1) #define create_freezeable_workqueue(name) __create_workqueue((name), 1, 1, 0) #define create_singlethread_workqueue(name) __create_workqueue((name), 1, 0, 0) ``` As we can see, the last few arguments would be the bit pattern to be mask by flags. Which would limits the extensibility of the workqueue implementation. As per [latest commit](https://github.com/torvalds/linux/blob/55df0933be74bd2e52aba0b67eb743ae0feabe7e/include/linux/workqueue.h), there are about 11 flags to be represented, which mean previous coding style would make the code meaninglessly complicated. Moreover, there are some operations that are simply immpossible to do by the previous coding style, such as `flag = (WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE)`, an additional macro would be needed. ```c /* Example following previous coding style */ #define create_cpu_intensive_workqueue(name) __create_workqueue((name), 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0) #define create_cpu_intensive_and_mem_reclaim_workqueue(name) __create_workqueue((name), 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0) ``` ### user-echo-server Before looking at the implementation there is one point to take notice, there aren't any threading library in the user-echo-server such as `pthreads` or `omp`, nor it has system calls like `fork` or `clone`, meaning the user-echo-server is running only a single execution context. Which would be the advantage of using I/O Multiplexer instead of non-blocking or blocking I/O, the server does not need to hangs the program like blocking I/O ,nor waste CPU time to repeatedly check for request like non-blocking I/O. #### code The server is first set up with the usual procedure ```c socket -> bind -> listen ``` Then an epoll structure is created for the server as listener ```c epoll_fd = epoll_create(EPOLL_SIZE) epoll_ctl(epoll_fd, EPOLL_CTL_ADD, listener, &ev) /* ev = { .events = EPOLLIN | EPOLLET } */ ``` Then it waits for events ```c epoll_wait(epoll_fd, events, EPOLL_SIZE, EPOLL_RUN_TIMEOUT) ``` There would be two fd to branch to 1. Listener (server) 2. Requestor (client) ```c if (events[i].data.fd == listener){ /* listener, accept new connection */ } else { /* requestor, handles request by client */ } ``` Listener is set to be nonblocking ```c setnonblock(listener) ``` if not, when accepting new connection, the program would stuck in the loop to accept ```c while((client = accept(listener, ...)) > 0) { /* ... */ epoll_ctl(epoll_fd, EPOLL_CTL_ADD, client, &ev); /* add client event to epoll */ } ``` As stated in `man 2 accept` > If no pending connections are present on the queue, and the socket is not marked as nonblocking, `accept()` blocks the caller until a connection is present. If the socket is marked nonblocking and no pending connections are present on the queue, `accept()` fails with the error `EAGAIN` or `EWOULDBLOCK`. Each of the Client is also set to be nonblocking ```c setnonblock(client) ``` There are no difference currently if we don't set client to be nonblocking (blocking) My first guess is this would be important for send/recv operations, when handling the varying buffer size, which would be done later. `man 2 recv` > If no messages are available at the socket, the receive calls wait for a message to arrive, unless the socket is nonblocking (see `fcntl`(2)), in which case the value -1 is returned and errno is set to `EAGAIN` or `EWOULDBLOCK`. `man 2 send` > When the message does not fit into the send buffer of the socket, `send()` normally blocks, unless the socket has been placed in nonblocking I/O mode. In nonblocking mode it would fail with the error `EAGAIN` or `EWOULDBLOCK` in this case. ### benchmark kecho :::info Linux default scheduling policy `SCHED_OTHER` would exclude any scheduling for isolated cpus, only one logical CPU core would be running by default. Real-time scheduling policy such as `SCHED_FIFO` would be used instead to measure the time taken for `bench.c` when CPU affinity is set to those isolated CPU. `$ sudo chrt -f 99 taskset 0xF ./bench && make plot` ::: ### Import CMWQ (replacing original kthread-based workqueue) > Note : Code stolen shamelessly from [foxhoundsk](https://github.com/sysprog21/kecho/pull/1) Commit [c8d604d](https://github.com/cantfindagoodname/khttpd/commit/c8d604da228fd96707ea7855995fcc6ff2cdc391) (tested in `Linux 5.13`, `Gen11 Intel Core i9-11900`) Original result of benchmarking ``` requests: 100000 good requests: 100000 [100%] bad requests: 0 [0%] socket errors: 0 [0%] seconds: 1.043 requests/sec: 95845.206 ``` Result after implementing CMWQ ``` requests: 100000 good requests: 100000 [100%] bad requests: 0 [0%] socket errors: 0 [0%] seconds: 0.553 requests/sec: 180796.844 ``` The pull request claims to have 10x improvement with CMWQ-daemon for [kecho](https://github.com/sysprog21/kecho), however, it only yields about 2x improvement for [khttpd](https://github.com/sysprog21/khttpd) Regardless, it is a drastic improvement. ### khttpd Directory Listing How the in-kernel web-server works before the changes: ``` $ wget localhost:1999 --2022-06-09 06:01:00-- http://localhost:1999/ Resolving localhost (localhost)... 127.0.0.1 Connecting to localhost (localhost)|127.0.0.1|:1999... connected. HTTP request sent, awaiting response... 200 OK Length: 12 [text/plain] Saving to: ‘index.html’ index.html 100%[============================>] 12 --.-KB/s in 0s 2022-06-09 06:01:00 (1.39 MB/s) - ‘index.html’ saved [12/12] $ cat index.html Hello World! ``` Check where `Hello World!` came from ``` $ grep -r "Hello World!" http_server.c: "Connection: Close" CRLF CRLF "Hello World!" CRLF http_server.c: "Connection: Keep-Alive" CRLF CRLF "Hello World!" CRLF ``` TODO Summary for requirements - khttpd directory listing - kecho recv buffer overflow requirements - kecho rmmod + close browser page fault - benchmark kecho (kthread, CMWQ) - kecho, user-echo-server comparison - performance analysis (select, poll, epoll) - analyze drop-tcp-socket