wrk is an open source benchmark tool for http servers.
eventLoop->apidata->epfd
Each thread handles thread->connections
number of connections, which is equal to cfg.connections / cfg.threads
.
thread->cs
contains an array of connections with size of thread->connections
, which contains a connected fd and other relevant information.
Each thread contains one aeEventLoop
with a setsize
of 10 + cfg.connections * 3
.
I don't know why setsize
is 10 + cfg.connections * 3
, I believe that something like 30 + cfg.connections
should be sufficient because maxfd
will not be much larger than cfg.connections
.
aeCreateTimeEvent
creates an aeTimeEvent
and add to the linked list head eventLoop->timeEventHead
.
connect_socket
uses aeCreatFileEvent
to add the fd to the epoll.
To upgrade the version of nginx, I have to have a brief understanding of how ESCA is implemented to v1.20.0.
Take a look at the Makefile and we discover that nginx will be modified by *.patch
(generated by diff) using patch.
The patch writes batch_start()
, batch_flush()
to surround the polling section of the server and writes esca_init()
that initializes the shared table.
Later, nginx will be modified by nginx.sh
which links the above function to shim.so
, that does nothing and returns 0, by adding -Wl,-E
flag and pathc in the Makefile
. That means we can launch the vanilla nginx normally.
Read through brief sed tutotial.
When make nginx-esca-launch
is called, we use LD_PRELOAD to link the functions to wrapper.so
to perform esca system calls.
sys_register
and sys_batch
.system_wq
.sys_register
and sys_batch
.get_user_pages
get_user_pages
is defined in linux/include/linux/mm.h.
pt_regs
is defined in linux/arch/x86/include/asm/ptrace.h, where each field in the structure corresponds to a register.
In wrapper.c
, esce_init()
allocate new memory and pass the begining address btable
to syscall
.
The value of btable
is then acquired in kernel mode in esca.c by struct pt_regs *
here, btable
in the above code will be the same as p1
in the following code.
btable
in user mode is passed as an agument and ended up being a field of *regs
.sizeof
a pointer in user mode is 64 bits and the address is 48 bits long, I don't know why the code is correct with p1
being 32 bits. I can only see that p1
is same as the 32 least significant bit of btable
.
Read through lkmpg: system calls.
Linux kernel handles a system call by looking at sys_call_table
, which is a static array containing the address of the functions to call.
We can add our own system call by writing a new function and change the pointer in sys_call_table
to point to our function in mod_init
and restore the original function in mod_clear
.
So now the question is: how to find the address of sys_call_table
. This is a little bit tricky because the symbol is protected by the kernel, we cannot access in the module.
/boot/System.map-$(uname -r)
stores the symbols' address. The address is fixed and will be the same every time the system reboots.
The address will then be added with a offset and store in /proc/kallsyms
, which is the address that kernel will use. The offset is different every time the system reboots.
Because we don't have direct access to the address of sys_call_table
and the address will change after reboot. We have to apply some tricks.
system_wq
is a symbol for a pre-defined workqueue in the kernel, and can be accessed in the kernel module. Using it, we can calculate the offset of System.map
to kallsyms
. Then read the address of sys_call_table
in System.map
and add the offset to obtain the address.
The above code is one line of code in esca.c, to find out the address of the sys_call_table
.
In the above example, the offset is D800000.
ffffffff8fa00300 - ffffffff83162218 = D800000
ffffffff82200300 - ffffffff90962218 = D800000
This commit is the first try to support LWAN web server.
The changes includes:
Add patch for lwan that adds esca_init()
before the main loop, batch_start()
, and batch_flush
surround the for loop after epoll_wait() to handle the system calls called by the ready fd.
Add script to add library to CMakelist.txt:
Using sed
in sh script to add the following code to link the functions,
cmake will then generate the make files.
When there is only one connection from the browser,
only send()
will be called, when there are multiple connections, send()
will be called once and then sendfile()
will be called.
Therefore, we need to send()
and sendfile()
in the wrapper. Also, because of compiler optimization (or maybe other reason), when the code sendfile()
is written, it actually calls sendfile64()
.
Test by wrk
:
The result with esca is about 30% faster.
nginx uses fork
for multi-process and use ngx_worker
to assign each process an id. For example, if there are 4 process, then ngx_worker
of each process is 0, 1, 2, 3. Using the number, batch_start
and batch_flush
can identify each process and use different table (or different part of one table) to avoid data race.