# [ESCA](https://eecheng87.github.io/ESCA/)
## Tasks
- [x] Support Nginx v1.22.0
## Benchmark tool: wrk
[wrk](https://github.com/wg/wrk) is an open source benchmark tool for http servers.
`eventLoop->apidata->epfd`
Each thread handles `thread->connections` number of connections, which is equal to `cfg.connections / cfg.threads`.
`thread->cs` contains an array of connections with size of `thread->connections`, which contains a connected fd and other relevant information.
Each thread contains one `aeEventLoop` with a `setsize` of `10 + cfg.connections * 3`.
:::info
I don't know why `setsize` is `10 + cfg.connections * 3`, I believe that something like `30 + cfg.connections` should be sufficient because `maxfd` will not be much larger than `cfg.connections`.
:::
`aeCreateTimeEvent` creates an `aeTimeEvent` and add to the linked list head `eventLoop->timeEventHead`.
`connect_socket` uses `aeCreatFileEvent` to add the fd to the epoll.
## Support Nginx v1.22.0
==[commit](https://github.com/yaohwang99/ESCA/commit/bfc0f27a5129a4a981540afcd69c0577d1a15676)==
To upgrade the version of nginx, I have to have a brief understanding of how ESCA is implemented to v1.20.0.
Take a look at the Makefile and we discover that nginx will be modified by `*.patch` (generated by [diff](https://man7.org/linux/man-pages/man1/diff.1.html)) using [patch](https://man7.org/linux/man-pages/man1/patch.1.html).
The patch writes `batch_start()`, `batch_flush()` to surround the polling section of the server and writes `esca_init()` that initializes the shared table.
Later, nginx will be modified by `nginx.sh` which links the above function to `shim.so`, that does nothing and returns 0, by adding `-Wl,-E` flag and pathc in the `Makefile`. That means we can launch the vanilla nginx normally.
Read through [brief sed tutotial](https://www.tutorialspoint.com/sed/index.htm).
When `make nginx-esca-launch` is called, we use LD_PRELOAD to link the functions to `wrapper.so` to perform esca system calls.
## Kernel Module
1. Define new system calls `sys_register` and `sys_batch`.
2. Derive the address of the system call table by the address of `system_wq`.
```c
scTab = (void **) (smSCTab + ((char *) &system_wq - smSysWQ));
```
3. Substitute number 183 and 184 of the syscall table with `sys_register` and `sys_batch`.
### DMA with `get_user_pages`
`get_user_pages` is defined in [linux/include/linux/mm.h](https://github.com/torvalds/linux/blob/master/include/linux/mm.h).
```c
/**
* get_user_pages() - pin user pages in memory
* @start: starting user address
* @nr_pages: number of pages from start to pin
* @gup_flags: flags modifying lookup behaviour
* @pages: array that receives pointers to the pages pinned.
* Should be at least nr_pages long. Or NULL, if caller
* only intends to ensure the pages are faulted in.
* @vmas: array of pointers to vmas corresponding to each page.
* Or NULL if the caller does not require them.
*
* This is the same as get_user_pages_remote(), just with a less-flexible
* calling convention where we assume that the mm being operated on belongs to
* the current task, and doesn't allow passing of a locked parameter. We also
* obviously don't pass FOLL_REMOTE in here.
*/
long get_user_pages(unsigned long start, unsigned long nr_pages,
unsigned int gup_flags, struct page **pages,
struct vm_area_struct **vmas);
```
`pt_regs` is defined in [linux/arch/x86/include/asm/ptrace.h](https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/ptrace.h), where each field in the structure corresponds to a register.
```c
struct pt_regs {
unsigned long r15;
unsigned long r14;
unsigned long r13;
unsigned long r12;
unsigned long bp;
unsigned long bx;
/* arguments: non interrupts/non tracing syscalls only save up to here*/
unsigned long r11;
unsigned long r10;
unsigned long r9;
unsigned long r8;
unsigned long ax;
unsigned long cx;
unsigned long dx;
unsigned long si;
unsigned long di;
unsigned long orig_ax;
/* end of arguments */
/* cpu exception frame or undefined */
unsigned long ip;
unsigned long cs;
unsigned long flags;
unsigned long sp;
unsigned long ss;
/* top of stack page */
};
```
In [`wrapper.c`](https://github.com/eecheng87/ESCA/blob/main/wrapper/wrapper.c), `esce_init()` allocate new memory and pass the begining address `btable` to `syscall`.
```c
long esca_init()
{
btable = aligned_alloc(pgsize, pgsize * MAX_THREAD_NUM);
syscall(__NR_register, btable);
return 0;
}
```
The value of `btable` is then acquired in kernel mode in [esca.c](https://github.com/eecheng87/ESCA/blob/main/lkm/esca.c) by `struct pt_regs *` here, ==`btable` in the above code will be the same as `p1` in the following code.==
```c
asmlinkage long sys_register(const struct pt_regs *regs)
{
int n_page, i, j;
unsigned long p1 = regs->di;
/* map batch table from user-space to kernel */
n_page = get_user_pages(
(p1), /* Start address to map */
MAX_THREAD_NUM, /* Number of pinned pages. 4096 btyes in this machine */
FOLL_FORCE | FOLL_WRITE, /* Force flag */
pinned_pages, /* struct page ** pointer to pinned pages */
NULL);
...
}
```
:::danger
:question:I don't know how `btable` in user mode is passed as an agument and ended up being a field of `*regs`.
:question:Also, `sizeof` a pointer in user mode is 64 bits and the address is 48 bits long, I don't know why the code is correct with `p1` being 32 bits. I can only see that `p1` is same as the 32 least significant bit of `btable`.
:::
### System Call Table
Read through [lkmpg: system calls](https://sysprog21.github.io/lkmpg/#system-calls).
Linux kernel handles a system call by looking at `sys_call_table`, which is a static array containing the address of the functions to call.
We can add our own system call by writing a new function and change the pointer in `sys_call_table` to point to our function in `mod_init` and restore the original function in `mod_clear`.
So now the question is: how to find the address of `sys_call_table`. This is a little bit tricky because the symbol is protected by the kernel, we cannot access in the module.
`/boot/System.map-$(uname -r)` stores the symbols' address. The address is fixed and will be the ==same== every time the system reboots.
The address will then be added with a offset and store in `/proc/kallsyms`, which is the address that kernel will use. The offset is ==different== every time the system reboots.
Because we don't have direct access to the address of `sys_call_table` and the address will change after reboot. We have to apply some tricks.
`system_wq` is a symbol for a pre-defined workqueue in the kernel, and can be accessed in the kernel module. Using it, we can calculate the offset of `System.map` to `kallsyms`. Then read the address of `sys_call_table` in `System.map` and add the offset to obtain the address.
```c
scTab = (void **) (smSCTab + ((char *) &system_wq - smSysWQ));
```
The above code is one line of code in esca.c, to find out the address of the `sys_call_table`.
```bash
$ sudo grep -w sys_call_table /proc/kallsyms
ffffffff8fa00300 R sys_call_table
$ sudo grep -w sys_call_table /boot/System.map-`uname -r`
ffffffff82200300 R sys_call_table
$ sudo grep -w system_wq /boot/System.map-`uname -r`
ffffffff83162218 D system_wq
$ sudo grep -w system_wq /proc/kallsyms
ffffffff90962218 D system_wq
```
In the above example, the offset is D800000.
> ffffffff8fa00300 - ffffffff83162218 = D800000
> ffffffff82200300 - ffffffff90962218 = D800000
## LWAN web server
This [commit](https://github.com/yaohwang99/ESCA/commit/1643577b2150b21fa212a785e16c2ba398d89492) is the first try to support LWAN web server.
The changes includes:
Add patch for lwan that adds `esca_init()` before the main loop, `batch_start()`, and `batch_flush` surround the for loop after epoll_wait() to handle the system calls called by the ready fd.
```diff
int
main(int argc, char *argv[])
{
struct lwan l;
struct lwan_config c;
struct lwan_straitjacket sj = {};
char root_buf[PATH_MAX];
char *root = root_buf;
int ret = EXIT_SUCCESS;
if (!getcwd(root, PATH_MAX))
return 1;
c = *lwan_get_default_config();
c.listener = strdup("*:8080");
+ esca_init();
switch (parse_args(argc, argv, &c, root, &sj)) {
// initialize lwan
}
lwan_main_loop(&l);
lwan_shutdown(&l);
}
```
```diff
for (;;) {
int timeout = turn_timer_wheel(&tq, t, epoll_fd);
int n_fds = epoll_wait(epoll_fd, events, max_events, timeout);
bool created_coros = false;
// some error handler
+ batch_start();
for (struct epoll_event *event = events; n_fds--; event++) {
// do something
}
+ batch_flush();
}
```
Add script to add library to CMakelist.txt:
Using `sed` in sh script to add the following code to link the functions,
cmake will then generate the make files.
```cmake
add_library(libshim SHARED IMPORTED GLOBAL)
set_target_properties(libshim PROPERTIES IMPORTED_LOCATION ${libpath})
list(APPEND ADDITIONAL_LIBRARIES libshim)
```
When there is only one connection from the browser,
only `send()` will be called, when there are multiple connections, `send()` will be called once and then `sendfile()` will be called.
Therefore, we need to `send()` and `sendfile()` in the wrapper. Also, because of compiler optimization (or maybe other reason), when the code `sendfile()` is written, it actually calls `sendfile64()`.
Test by `wrk`:
```shell
$ downloads/wrk-master/wrk -c 50 -d 5s -t 4 http://localhost:8080/a20.html
Running 5s test @ http://localhost:8080/a20.html
4 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 522.10us 238.06us 4.95ms 87.88%
Req/Sec 23.43k 5.59k 36.21k 61.27%
475312 requests in 5.10s, 9.35GB read
Requests/sec: 93206.67
Transfer/sec: 1.83GB
$ downloads/wrk-master/wrk -c 50 -d 5s -t 4 http://localhost:8080/a20.html
Running 5s test @ http://localhost:8080/a20.html
4 threads and 50 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 355.72us 130.89us 4.57ms 88.79%
Req/Sec 33.69k 7.04k 86.13k 85.64%
676909 requests in 5.10s, 13.31GB read
Requests/sec: 132739.46
Transfer/sec: 2.61GB
```
The result with esca is about 30% faster.
## Support multi-threaded server
### Nginx
nginx uses `fork` for multi-process and use `ngx_worker` to assign each process an id. For example, if there are 4 process, then `ngx_worker` of each process is 0, 1, 2, 3. Using the number, `batch_start` and `batch_flush` can identify each process and use different table (or different part of one table) to avoid data race.