ESCA

Tasks

Support Nginx v1.22.0

Benchmark tool: wrk

wrk is an open source benchmark tool for http servers.
eventLoop->apidata->epfd
Each thread handles thread->connections number of connections, which is equal to cfg.connections / cfg.threads.
thread->cs contains an array of connections with size of thread->connections, which contains a connected fd and other relevant information.
Each thread contains one aeEventLoop with a setsize of 10 + cfg.connections * 3.

I don't know why setsize is 10 + cfg.connections * 3, I believe that something like 30 + cfg.connections should be sufficient because maxfd will not be much larger than cfg.connections.

aeCreateTimeEvent creates an aeTimeEvent and add to the linked list head eventLoop->timeEventHead.
connect_socket uses aeCreatFileEvent to add the fd to the epoll.

Support Nginx v1.22.0

commit

To upgrade the version of nginx, I have to have a brief understanding of how ESCA is implemented to v1.20.0.
Take a look at the Makefile and we discover that nginx will be modified by *.patch (generated by diff) using patch.
The patch writes batch_start(), batch_flush() to surround the polling section of the server and writes esca_init() that initializes the shared table.
Later, nginx will be modified by nginx.sh which links the above function to shim.so, that does nothing and returns 0, by adding -Wl,-E flag and pathc in the Makefile. That means we can launch the vanilla nginx normally.
Read through brief sed tutotial.
When make nginx-esca-launch is called, we use LD_PRELOAD to link the functions to wrapper.so to perform esca system calls.

Kernel Module

Define new system calls sys_register and sys_batch.
Derive the address of the system call table by the address of system_wq.

scTab = (void **) (smSCTab + ((char *) &system_wq - smSysWQ));

Substitute number 183 and 184 of the syscall table with sys_register and sys_batch.

DMA with `get_user_pages`

get_user_pages is defined in linux/include/linux/mm.h.

/**
 * get_user_pages() - pin user pages in memory
 * @start:      starting user address
 * @nr_pages:   number of pages from start to pin
 * @gup_flags:  flags modifying lookup behaviour
 * @pages:      array that receives pointers to the pages pinned.
 *              Should be at least nr_pages long. Or NULL, if caller
 *              only intends to ensure the pages are faulted in.
 * @vmas:       array of pointers to vmas corresponding to each page.
 *              Or NULL if the caller does not require them.
 *
 * This is the same as get_user_pages_remote(), just with a less-flexible
 * calling convention where we assume that the mm being operated on belongs to
 * the current task, and doesn't allow passing of a locked parameter.  We also
 * obviously don't pass FOLL_REMOTE in here.
 */
long get_user_pages(unsigned long start, unsigned long nr_pages,
			    unsigned int gup_flags, struct page **pages,
			    struct vm_area_struct **vmas);

pt_regs is defined in linux/arch/x86/include/asm/ptrace.h, where each field in the structure corresponds to a register.

struct pt_regs {
	unsigned long r15;
	unsigned long r14;
	unsigned long r13;
	unsigned long r12;
	unsigned long bp;
	unsigned long bx;
/* arguments: non interrupts/non tracing syscalls only save up to here*/
	unsigned long r11;
	unsigned long r10;
	unsigned long r9;
	unsigned long r8;
	unsigned long ax;
	unsigned long cx;
	unsigned long dx;
	unsigned long si;
	unsigned long di;
	unsigned long orig_ax;
/* end of arguments */
/* cpu exception frame or undefined */
	unsigned long ip;
	unsigned long cs;
	unsigned long flags;
	unsigned long sp;
	unsigned long ss;
/* top of stack page */
};

In wrapper.c, esce_init() allocate new memory and pass the begining address btable to syscall.

long esca_init()
{
    btable = aligned_alloc(pgsize, pgsize * MAX_THREAD_NUM);
    syscall(__NR_register, btable);
    return 0;
}

The value of btable is then acquired in kernel mode in esca.c by struct pt_regs * here, btable in the above code will be the same as p1 in the following code.

asmlinkage long sys_register(const struct pt_regs *regs)
{
    int n_page, i, j;
    unsigned long p1 = regs->di;

    /* map batch table from user-space to kernel */
    n_page = get_user_pages(
        (p1),           /* Start address to map */
        MAX_THREAD_NUM, /* Number of pinned pages. 4096 btyes in this machine */
        FOLL_FORCE | FOLL_WRITE, /* Force flag */
        pinned_pages,            /* struct page ** pointer to pinned pages */
        NULL);
    ...
}

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

I don't know how btable in user mode is passed as an agument and ended up being a field of *regs.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Also, sizeof a pointer in user mode is 64 bits and the address is 48 bits long, I don't know why the code is correct with p1 being 32 bits. I can only see that p1 is same as the 32 least significant bit of btable.

System Call Table

Read through lkmpg: system calls.
Linux kernel handles a system call by looking at sys_call_table, which is a static array containing the address of the functions to call.

We can add our own system call by writing a new function and change the pointer in sys_call_table to point to our function in mod_init and restore the original function in mod_clear.

So now the question is: how to find the address of sys_call_table. This is a little bit tricky because the symbol is protected by the kernel, we cannot access in the module.
/boot/System.map-$(uname -r) stores the symbols' address. The address is fixed and will be the same every time the system reboots.
The address will then be added with a offset and store in /proc/kallsyms, which is the address that kernel will use. The offset is different every time the system reboots.

Because we don't have direct access to the address of sys_call_table and the address will change after reboot. We have to apply some tricks.

system_wq is a symbol for a pre-defined workqueue in the kernel, and can be accessed in the kernel module. Using it, we can calculate the offset of System.map to kallsyms. Then read the address of sys_call_table in System.map and add the offset to obtain the address.

scTab = (void **) (smSCTab + ((char *) &system_wq - smSysWQ));

The above code is one line of code in esca.c, to find out the address of the sys_call_table.

$ sudo grep -w sys_call_table /proc/kallsyms
ffffffff8fa00300 R sys_call_table
$ sudo grep -w sys_call_table /boot/System.map-`uname -r`
ffffffff82200300 R sys_call_table

$ sudo grep -w system_wq /boot/System.map-`uname -r`
ffffffff83162218 D system_wq
$ sudo grep -w system_wq /proc/kallsyms
ffffffff90962218 D system_wq

In the above example, the offset is D800000.

ffffffff8fa00300 - ffffffff83162218 = D800000
ffffffff82200300 - ffffffff90962218 = D800000

LWAN web server

This commit is the first try to support LWAN web server.
The changes includes:
Add patch for lwan that adds esca_init() before the main loop, batch_start(), and batch_flush surround the for loop after epoll_wait() to handle the system calls called by the ready fd.

int
main(int argc, char *argv[])
{
    struct lwan l;
    struct lwan_config c;
    struct lwan_straitjacket sj = {};
    char root_buf[PATH_MAX];
    char *root = root_buf;
    int ret = EXIT_SUCCESS;

    if (!getcwd(root, PATH_MAX))
        return 1;

    c = *lwan_get_default_config();
    c.listener = strdup("*:8080");
+   esca_init();
    switch (parse_args(argc, argv, &c, root, &sj)) {
        // initialize lwan
    }

    lwan_main_loop(&l);
    lwan_shutdown(&l);

}

for (;;) {
        int timeout = turn_timer_wheel(&tq, t, epoll_fd);
        int n_fds = epoll_wait(epoll_fd, events, max_events, timeout);
        bool created_coros = false;

        // some error handler
        
+       batch_start();
        for (struct epoll_event *event = events; n_fds--; event++) {
            // do something
        }

+       batch_flush();
    }

Add script to add library to CMakelist.txt:
Using sed in sh script to add the following code to link the functions,
cmake will then generate the make files.

add_library(libshim SHARED IMPORTED GLOBAL)
set_target_properties(libshim PROPERTIES IMPORTED_LOCATION ${libpath})
list(APPEND ADDITIONAL_LIBRARIES libshim)

When there is only one connection from the browser,
only send() will be called, when there are multiple connections, send() will be called once and then sendfile() will be called.
Therefore, we need to send() and sendfile() in the wrapper. Also, because of compiler optimization (or maybe other reason), when the code sendfile() is written, it actually calls sendfile64().

Test by wrk:

$ downloads/wrk-master/wrk -c 50 -d 5s -t 4 http://localhost:8080/a20.html
Running 5s test @ http://localhost:8080/a20.html
  4 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   522.10us  238.06us   4.95ms   87.88%
    Req/Sec    23.43k     5.59k   36.21k    61.27%
  475312 requests in 5.10s, 9.35GB read
Requests/sec:  93206.67
Transfer/sec:      1.83GB

$ downloads/wrk-master/wrk -c 50 -d 5s -t 4 http://localhost:8080/a20.html
Running 5s test @ http://localhost:8080/a20.html
  4 threads and 50 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   355.72us  130.89us   4.57ms   88.79%
    Req/Sec    33.69k     7.04k   86.13k    85.64%
  676909 requests in 5.10s, 13.31GB read
Requests/sec: 132739.46
Transfer/sec:      2.61GB

The result with esca is about 30% faster.

Support multi-threaded server

Nginx

nginx uses fork for multi-process and use ngx_worker to assign each process an id. For example, if there are 4 process, then ngx_worker of each process is 0, 1, 2, 3. Using the number, batch_start and batch_flush can identify each process and use different table (or different part of one table) to avoid data race.

eecheng

2022/06/07 23:28:46

sys_register 是 ESCA 新增的系統呼叫，所以機制可以想成一般 Linux System Call。在 2018 年後，系統呼叫的參數被包裝成 pt_regs 供 kernel 使用。在 user space (e.g. glibc) 的 wrapper 負責將所有 register 堆到 stack 上，可能由以下實做組成: pushq %r11 pushq $__USER_CS pushq %rcx pushq %rax pushq %rdi pushq %rsi pushq %rdx pushq %rcx pushq $-ENOSYS pushq %r8 pushq %r9 pushq %r10 pushq %r11 sub $(6*8), %rsp (Edited)

2022/06/08 00:06:11

這裡轉成 unsigned long 是為了配合 get_user_pages 的函數原形，若要追究原因可能要追蹤一下該函數。 PS: 關於 linux 為什麼用 unsigned long 表示 address 而非 void* 可參考以下文章第二頁最後一段。# https://static.lwn.net/images/pdf/LDD3/ch11.pdf (Edited)

ESCA

Tasks

Benchmark tool: wrk

Support Nginx v1.22.0

Kernel Module

DMA with get_user_pages

System Call Table

LWAN web server

Support multi-threaded server

Nginx

Read more

2022q1 Homework6 (ktcp)

2022q1 Homework5 (quiz8)

2022q1 Homework5 (quiz6)

2022q1 Homework5 (quiz5)

DMA with `get_user_pages`