# [ESCA](https://eecheng87.github.io/ESCA/) ## Tasks - [x] Support Nginx v1.22.0 ## Benchmark tool: wrk [wrk](https://github.com/wg/wrk) is an open source benchmark tool for http servers. `eventLoop->apidata->epfd` Each thread handles `thread->connections` number of connections, which is equal to `cfg.connections / cfg.threads`. `thread->cs` contains an array of connections with size of `thread->connections`, which contains a connected fd and other relevant information. Each thread contains one `aeEventLoop` with a `setsize` of `10 + cfg.connections * 3`. :::info I don't know why `setsize` is `10 + cfg.connections * 3`, I believe that something like `30 + cfg.connections` should be sufficient because `maxfd` will not be much larger than `cfg.connections`. ::: `aeCreateTimeEvent` creates an `aeTimeEvent` and add to the linked list head `eventLoop->timeEventHead`. `connect_socket` uses `aeCreatFileEvent` to add the fd to the epoll. ## Support Nginx v1.22.0 ==[commit](https://github.com/yaohwang99/ESCA/commit/bfc0f27a5129a4a981540afcd69c0577d1a15676)== To upgrade the version of nginx, I have to have a brief understanding of how ESCA is implemented to v1.20.0. Take a look at the Makefile and we discover that nginx will be modified by `*.patch` (generated by [diff](https://man7.org/linux/man-pages/man1/diff.1.html)) using [patch](https://man7.org/linux/man-pages/man1/patch.1.html). The patch writes `batch_start()`, `batch_flush()` to surround the polling section of the server and writes `esca_init()` that initializes the shared table. Later, nginx will be modified by `nginx.sh` which links the above function to `shim.so`, that does nothing and returns 0, by adding `-Wl,-E` flag and pathc in the `Makefile`. That means we can launch the vanilla nginx normally. Read through [brief sed tutotial](https://www.tutorialspoint.com/sed/index.htm). When `make nginx-esca-launch` is called, we use LD_PRELOAD to link the functions to `wrapper.so` to perform esca system calls. ## Kernel Module 1. Define new system calls `sys_register` and `sys_batch`. 2. Derive the address of the system call table by the address of `system_wq`. ```c scTab = (void **) (smSCTab + ((char *) &system_wq - smSysWQ)); ``` 3. Substitute number 183 and 184 of the syscall table with `sys_register` and `sys_batch`. ### DMA with `get_user_pages` `get_user_pages` is defined in [linux/include/linux/mm.h](https://github.com/torvalds/linux/blob/master/include/linux/mm.h). ```c /** * get_user_pages() - pin user pages in memory * @start: starting user address * @nr_pages: number of pages from start to pin * @gup_flags: flags modifying lookup behaviour * @pages: array that receives pointers to the pages pinned. * Should be at least nr_pages long. Or NULL, if caller * only intends to ensure the pages are faulted in. * @vmas: array of pointers to vmas corresponding to each page. * Or NULL if the caller does not require them. * * This is the same as get_user_pages_remote(), just with a less-flexible * calling convention where we assume that the mm being operated on belongs to * the current task, and doesn't allow passing of a locked parameter. We also * obviously don't pass FOLL_REMOTE in here. */ long get_user_pages(unsigned long start, unsigned long nr_pages, unsigned int gup_flags, struct page **pages, struct vm_area_struct **vmas); ``` `pt_regs` is defined in [linux/arch/x86/include/asm/ptrace.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/ptrace.h), where each field in the structure corresponds to a register. ```c struct pt_regs { unsigned long r15; unsigned long r14; unsigned long r13; unsigned long r12; unsigned long bp; unsigned long bx; /* arguments: non interrupts/non tracing syscalls only save up to here*/ unsigned long r11; unsigned long r10; unsigned long r9; unsigned long r8; unsigned long ax; unsigned long cx; unsigned long dx; unsigned long si; unsigned long di; unsigned long orig_ax; /* end of arguments */ /* cpu exception frame or undefined */ unsigned long ip; unsigned long cs; unsigned long flags; unsigned long sp; unsigned long ss; /* top of stack page */ }; ``` In [`wrapper.c`](https://github.com/eecheng87/ESCA/blob/main/wrapper/wrapper.c), `esce_init()` allocate new memory and pass the begining address `btable` to `syscall`. ```c long esca_init() { btable = aligned_alloc(pgsize, pgsize * MAX_THREAD_NUM); syscall(__NR_register, btable); return 0; } ``` The value of `btable` is then acquired in kernel mode in [esca.c](https://github.com/eecheng87/ESCA/blob/main/lkm/esca.c) by `struct pt_regs *` here, ==`btable` in the above code will be the same as `p1` in the following code.== ```c asmlinkage long sys_register(const struct pt_regs *regs) { int n_page, i, j; unsigned long p1 = regs->di; /* map batch table from user-space to kernel */ n_page = get_user_pages( (p1), /* Start address to map */ MAX_THREAD_NUM, /* Number of pinned pages. 4096 btyes in this machine */ FOLL_FORCE | FOLL_WRITE, /* Force flag */ pinned_pages, /* struct page ** pointer to pinned pages */ NULL); ... } ``` :::danger :question:I don't know how `btable` in user mode is passed as an agument and ended up being a field of `*regs`. :question:Also, `sizeof` a pointer in user mode is 64 bits and the address is 48 bits long, I don't know why the code is correct with `p1` being 32 bits. I can only see that `p1` is same as the 32 least significant bit of `btable`. ::: ### System Call Table Read through [lkmpg: system calls](https://sysprog21.github.io/lkmpg/#system-calls). Linux kernel handles a system call by looking at `sys_call_table`, which is a static array containing the address of the functions to call. We can add our own system call by writing a new function and change the pointer in `sys_call_table` to point to our function in `mod_init` and restore the original function in `mod_clear`. So now the question is: how to find the address of `sys_call_table`. This is a little bit tricky because the symbol is protected by the kernel, we cannot access in the module. `/boot/System.map-$(uname -r)` stores the symbols' address. The address is fixed and will be the ==same== every time the system reboots. The address will then be added with a offset and store in `/proc/kallsyms`, which is the address that kernel will use. The offset is ==different== every time the system reboots. Because we don't have direct access to the address of `sys_call_table` and the address will change after reboot. We have to apply some tricks. `system_wq` is a symbol for a pre-defined workqueue in the kernel, and can be accessed in the kernel module. Using it, we can calculate the offset of `System.map` to `kallsyms`. Then read the address of `sys_call_table` in `System.map` and add the offset to obtain the address. ```c scTab = (void **) (smSCTab + ((char *) &system_wq - smSysWQ)); ``` The above code is one line of code in esca.c, to find out the address of the `sys_call_table`. ```bash $ sudo grep -w sys_call_table /proc/kallsyms ffffffff8fa00300 R sys_call_table $ sudo grep -w sys_call_table /boot/System.map-`uname -r` ffffffff82200300 R sys_call_table $ sudo grep -w system_wq /boot/System.map-`uname -r` ffffffff83162218 D system_wq $ sudo grep -w system_wq /proc/kallsyms ffffffff90962218 D system_wq ``` In the above example, the offset is D800000. > ffffffff8fa00300 - ffffffff83162218 = D800000 > ffffffff82200300 - ffffffff90962218 = D800000 ## LWAN web server This [commit](https://github.com/yaohwang99/ESCA/commit/1643577b2150b21fa212a785e16c2ba398d89492) is the first try to support LWAN web server. The changes includes: Add patch for lwan that adds `esca_init()` before the main loop, `batch_start()`, and `batch_flush` surround the for loop after epoll_wait() to handle the system calls called by the ready fd. ```diff int main(int argc, char *argv[]) { struct lwan l; struct lwan_config c; struct lwan_straitjacket sj = {}; char root_buf[PATH_MAX]; char *root = root_buf; int ret = EXIT_SUCCESS; if (!getcwd(root, PATH_MAX)) return 1; c = *lwan_get_default_config(); c.listener = strdup("*:8080"); + esca_init(); switch (parse_args(argc, argv, &c, root, &sj)) { // initialize lwan } lwan_main_loop(&l); lwan_shutdown(&l); } ``` ```diff for (;;) { int timeout = turn_timer_wheel(&tq, t, epoll_fd); int n_fds = epoll_wait(epoll_fd, events, max_events, timeout); bool created_coros = false; // some error handler + batch_start(); for (struct epoll_event *event = events; n_fds--; event++) { // do something } + batch_flush(); } ``` Add script to add library to CMakelist.txt: Using `sed` in sh script to add the following code to link the functions, cmake will then generate the make files. ```cmake add_library(libshim SHARED IMPORTED GLOBAL) set_target_properties(libshim PROPERTIES IMPORTED_LOCATION ${libpath}) list(APPEND ADDITIONAL_LIBRARIES libshim) ``` When there is only one connection from the browser, only `send()` will be called, when there are multiple connections, `send()` will be called once and then `sendfile()` will be called. Therefore, we need to `send()` and `sendfile()` in the wrapper. Also, because of compiler optimization (or maybe other reason), when the code `sendfile()` is written, it actually calls `sendfile64()`. Test by `wrk`: ```shell $ downloads/wrk-master/wrk -c 50 -d 5s -t 4 http://localhost:8080/a20.html Running 5s test @ http://localhost:8080/a20.html 4 threads and 50 connections Thread Stats Avg Stdev Max +/- Stdev Latency 522.10us 238.06us 4.95ms 87.88% Req/Sec 23.43k 5.59k 36.21k 61.27% 475312 requests in 5.10s, 9.35GB read Requests/sec: 93206.67 Transfer/sec: 1.83GB $ downloads/wrk-master/wrk -c 50 -d 5s -t 4 http://localhost:8080/a20.html Running 5s test @ http://localhost:8080/a20.html 4 threads and 50 connections Thread Stats Avg Stdev Max +/- Stdev Latency 355.72us 130.89us 4.57ms 88.79% Req/Sec 33.69k 7.04k 86.13k 85.64% 676909 requests in 5.10s, 13.31GB read Requests/sec: 132739.46 Transfer/sec: 2.61GB ``` The result with esca is about 30% faster. ## Support multi-threaded server ### Nginx nginx uses `fork` for multi-process and use `ngx_worker` to assign each process an id. For example, if there are 4 process, then `ngx_worker` of each process is 0, 1, 2, 3. Using the number, `batch_start` and `batch_flush` can identify each process and use different table (or different part of one table) to avoid data race.