How ESCA Works

--- tags: Research --- # How ESCA Works ## ESCA 運作原理在 `batch_start()` 和 `batch_flush()` 之間的 block 稱為 `batch segment`，ESCA 會防止 `batch segment` 中的 system call 進入 kernel mode 並把這些 systemcall 的 system call 編號 (system call ID) 以及對應的參數寫入 shared table 中。在 `batch_flush()` 呼叫後，ESCA 會執行 shared table 中的所有 system call，接著回到 user mode。 ## system call flow ### Overview ![](https://i.imgur.com/HaK4uIJ.png) ![](https://i.imgur.com/zEXqLKH.png) 1. user application call system call 2. 藉由一個軟體中斷 trap 從 user mode 進入 kernel mode 3. 透過查詢 interrupt vector table 得知採用哪種interrupt service routine(ISR) 4. 從該 ISR 得知要查詢 system call table 5. 查詢到做哪個 system call service routine 後呼叫對應的 function，完成該 system call service routine 6. 當執行完 system call service routine 後將控制權切換給 `ret_from_sys`。它會去檢查那些在切換回 user mode 之前需要完成的任務。如果沒有需要做的事情，那就切換回 user mode ### system call wrapper function 因為 system call 是利用 assembly code 呼叫，將 assemly code 利用 wrapper function 包起來，可增加程式可讀性，和區分傳入參數。 Ex: ```c #define SYSCALL(name, a1, a2, a3, a4, a5, a6) \ ({ \ long result; \ long __a1 = (long) (a1), __a2 = (long) (a2), __a3 = (long) (a3); \ long __a4 = (long) (a4), __a5 = (long) (a5), __a6 = (long) (a6); \ register long _a1 asm("rdi") = __a1; \ register long _a2 asm("rsi") = __a2; \ register long _a3 asm("rdx") = __a3; \ register long _a4 asm("r10") = __a4; \ register long _a5 asm("r8") = __a5; \ register long _a6 asm("r9") = __a6; \ asm volatile("syscall\n\t" \ : "=a"(result) \ : "0"(name), "r"(_a1), "r"(_a2), "r"(_a3), "r"(_a4), \ "r"(_a5), "r"(_a6) \ : "memory", "cc", "r11", "cx"); \ (long) result; \ }) #define SYSCALL1(name, a1) SYSCALL(name, a1, 0, 0, 0, 0, 0) #define SYSCALL2(name, a1, a2) SYSCALL(name, a1, a2, 0, 0, 0, 0) #define SYSCALL3(name, a1, a2, a3) SYSCALL(name, a1, a2, a3, 0, 0, 0) #define SYSCALL4(name, a1, a2, a3, a4) SYSCALL(name, a1, a2, a3, a4, 0, 0) #define SYSCALL5(name, a1, a2, a3, a4, a5) SYSCALL(name, a1, a2, a3, a4, a5, 0) #define SYSCALL6(name, a1, a2, a3, a4, a5, a6) \ SYSCALL(name, a1, a2, a3, a4, a5, a6) /* wrapper function */ static inline void *brk(void *addr) { return (void *) SYSCALL1(__NR_brk, addr); } ``` ## 如何替換系統呼叫 (hook system call) userspace application 呼叫 system call 並切換至 kernel mode 時，會依據 syscall number 查找 syscall table，找到對應的 syscall handler 並執行。ESCA 利用替換 syscall table entry 的方式，將原本的 syscall handler 換成自定義的 syscall handler。在修改 syscall table entry 要先清除 `write protection bit` 後才能進行修改。 * 兩個 system call 1. sys_batch: iterate shared table 並依照 table entry 的 syscall number 和參數執行 system call，接著回到 user mode。 2. sys_register: mapping userspace shared table to kernel space memory 和初始化 ### 替換 systemcall handler ```c // find out syscall table address scTab = (void **) (smSCTab + ((char *) &system_wq - smSysWQ)); // clear write protection bit allow_writes(); /* backup */ sys_oldcall0 = scTab[__NR_batch_flush]; sys_oldcall1 = scTab[__NR_register]; /* hooking */ scTab[__NR_batch_flush] = sys_batch; scTab[__NR_register] = sys_register; // set write protection bit disallow_writes(); ``` ## 如何讓 kernel 和 user 看到同一塊 physical address space `p1` 代表在 userspace 的 virtual page address，請求 10 個連續的 virtual page 所對應到的不連續的 10 個 physical page 放到 `pinned_pages` 中。接著在 for loop 中將 10 個 physical page 映射到 kernel memory 中的 10 個連續的 page，並存於 `batch_table` 中。 * `batch_register` syscall 會 mapping userspace shared table to kernel space memory 和初始化 ```c asmlinkage long sys_register(const struct pt_regs *regs) { int n_page, i, j; unsigned long p1 = regs->di; /* map batch table from user-space to kernel */ n_page = get_user_pages( (p1), /* Start address to map */ MAX_THREAD_NUM, /* Number of pinned pages. 4096 btyes in this machine */ FOLL_FORCE | FOLL_WRITE, /* Force flag */ pinned_pages, /* struct page ** pointer to pinned pages */ NULL); for (i = 0; i < MAX_THREAD_NUM; i++) batch_table[i] = (struct batch_entry *) kmap(pinned_pages[i]); /* initial table status */ for (j = 0; j < MAX_THREAD_NUM; j++) for (i = 0; i < MAX_ENTRY_NUM; i++) batch_table[j][i].rstatus = BENTRY_EMPTY; global_i = global_j = 0; main_pid = current->pid; return 0; } ``` ### get_user_pages 請求 user_space 的 virtual page 所對應到的 physical page，返回的 physiacl page fd 會放在參數 `struct page **pages` 中 (ESCA 中的 `pinned_pages`)。 ```c n_page = get_user_pages( (p1), /* Start address to map */ MAX_THREAD_NUM, /* Number of pinned pages. 4096 btyes in this machine */ FOLL_FORCE | FOLL_WRITE, /* Force flag */ pinned_pages, /* struct page ** pointer to pinned pages */ NULL); for (i = 0; i < MAX_THREAD_NUM; i++) batch_table[i] = (struct batch_entry *) kmap(pinned_pages[i]); ``` ### kmap `kmap()` 用於相對短時間的 mapping，只能 mapping 單個 page. 將 physical page 映射到 kernel memory 中的某個 page 上。 ## 如何改變傳統系統呼叫的行為 ESCA 利用 `LD_PRELOAD` 將原本 glibc 所提供的 Syscall wrpper 替換成自定義的 Syscall wrapper。 ESCA 中自定義的 Syscall wrapper 會判斷 system call 呼叫的地方是否在在 `batch segment` 中。 1. 不在 `batch segment` 中: 呼叫 glibc 所提供的 Syscall wrpper。 2. 在 `batch segment` 中: 將 syscall number 和參數填入 shared table 中。 ### dlsym The function `dlsym()` takes a "handle" of a dynamic library returned by `dlopen()` and the null-terminated symbol name, returning the address where that symbol is loaded into memory. 因為在執行時將 system call handler 動態 link 到我們自己寫的 system call handler，這邊 `dlsym` 用來找到原始版本的 system call handler 並備份。 ```c __attribute__((constructor)) static void setup(void) { pgsize = getpagesize(); in_segment = 0; batch_num = 0; /* store glibc function */ real_writev = real_writev ? real_writev : dlsym(RTLD_NEXT, "writev"); real_shutdown = real_shutdown ? real_shutdown : dlsym(RTLD_NEXT, "shutdown"); real_sendfile = real_sendfile ? real_sendfile : dlsym(RTLD_NEXT, "sendfile"); real_send = real_send ? real_send : dlsym(RTLD_NEXT, "send"); global_i = global_j = 0; } ``` ### asmlinkage 當 system call handler 要呼叫相對應的 system call routine 時，便將一般用途暫存器的值 push 到 stack 裡，因此 system call routine 就要由 stack 來讀取 system call handler 傳遞的參數。這就是 asmlinkage 標籤的用意。 system call handler 是 assembly code，system call routine（例如：sys_nice）是 C code，當 assembly code 呼叫 C function，並且是以 stack 方式傳參數（parameter）時，在 C function 的 prototype 前面就要加上 "asmlinkage"。加上 "asmlinkage" 後，C function 就會由 stack 取參數，而不是從 register 取參數（可能發生在程式碼最佳化後）。 ## lighttpd 內的 event-driven 框架 ### epoll_wait epoll 為 I/O multiplexing model, 其實作利用兩個 queue, 一個是 interest queue, queue 上接的是被 epoll 監視且正在等待 I/O event ready 的 thread, 另一是 ready queue, 上面接的是 I/O event 已經 ready, 可以準備做 I/O 的 thread。實作方面, 首先利用 `epoll_create` 建立 epoll instance, 接著利用 `epoll_ctl`, 這邊利用到 `EPOLL_CTL_ADD`, 這個 operator 是用來將剛剛建立的 socket listener 接在 epoll 的 interest queue 上, 讓 epoll 來 monitor 其 I/O event。 `epoll_wait` 會輪詢 I/O 事件並查看是否 ready, `epoll_wait` 的第二個參數是 event buffer, 已經 ready 的 I/O 事件相關資料會存放在 buffer 中。因為事件在加入 `epoll` 前都已經先註冊好了對應的 handler，因此 epoll 回傳後，會呼叫這些 ready 的 I/O 事件所對應的 event handler。 ```c static int fdevent_linux_sysepoll_poll(fdevents * const ev, int timeout_ms) { int n = epoll_wait(ev->epoll_fd, ev->epoll_events, ev->maxfds, timeout_ms); for (int i = 0; i < n; ++i) { fdnode * const fdn = (fdnode *)ev->epoll_events[i].data.ptr; int revents = ev->epoll_events[i].events; if ((fdevent_handler)NULL != fdn->handler) { (*fdn->handler)(fdn->ctx, revents); } } return n; } ``` ### 利用 strace 觀察在 lighttpd 中的 system call 通過觀察 lighttpd 中的 system call 推測 client 向 lighttpd 進行連線流程: 1. `epoll_wait` 等待完成連接的 socket 2. `epoll` 回傳完成連接的 socket，接著 accept 這些連線並為其建立新的 socket fd 3. 利用 `read` 讀取 socket 中的 request，但 request 尚未 ready，所以將這個 I/O 事件加入 `epoll` 中輪詢 4. request 傳送完成，利用 `read` 讀取 socket 中的 request 5. request 為請求 Web Page，利用 lseek 將讀取的 offset 移至 Web Page 檔案存放的位置並利用 `read` 讀取 6. 將 Web Page 利用 `writev` 寫入 socket 中傳送給 client ``` ... 4005.609 (577.243 ms): lighttpd/14597 epoll_wait(epfd: 4<anon_inode:[eventpoll]>, events: 0x55e5030116d0, maxevents: 1025, timeout: 1000) = 1 ... 4583.125 ( 0.002 ms): lighttpd/14597 accept4(fd: 3<socket:[631034]>, upeer_sockaddr: 0x7ffc3059b420, upeer_addrlen: 0x7ffc3059b3c4, flags: 526336) = 6 4583.129 ( 0.001 ms): lighttpd/14597 read(fd: 6<socket:[631037]>, buf: 0x55e503048420, count: 8191) = -1 EAGAIN (Resource temporarily unavailable) 4583.131 ( 0.001 ms): lighttpd/14597 epoll_ctl(epfd: 4<anon_inode:[eventpoll]>, op: ADD, fd: 6<socket:[631037]>, event: 0x7ffc3059b47c) = 0 ... 10155.837 ( 0.001 ms): lighttpd/14303 read(fd: 6<socket:[636120]>, buf: 0x55a045451630, count: 8191) = 106 10155.846 ( 0.001 ms): lighttpd/14303 lseek(fd: 15</home/qwe661234/ESCA/web/index.html>, whence: SET) = 0 10155.848 ( 0.001 ms): lighttpd/14303 read(fd: 15</home/qwe661234/ESCA/web/index.html>, buf: 0x55a04544e4f9, count: 464) = 464 10155.850 ( 0.005 ms): lighttpd/14303 writev(fd: 6, vec: 0x7ffc27ee0110, vlen: 1) = 681 ... ``` ### 測試指令 * Server > $ sudo perf trace ./downloads/lighttpd1.4-lighttpd-1.4.58/src/lighttpd -D -f downloads/lighttpd1.4-lighttpd-1.4.58/src/lighttpd.conf * Client > $ ab -n 10 -c 10 -k http://127.0.0.1:3000/ ## Problem 1. 總共有 10 個 batch_table，是否每個 table 都是由不同 thread 去執行？ :::info 不是，這 10 個 table 由一個 thread iterate 整個 table 並執行。 ::: 2. Why can we set `global_j` to 0 without call `system_flush`? ```c if (global_j == MAX_THREAD_NUM - 1) { global_j = 0; } else { global_j++; } ``` :::info 實驗設計不會讓一次要 batch 的 syscall 數量超過 640 個。 ::: 3. [dlsym 用途](https://hackmd.io/qF7daNQgSteV2m_UEQKrNw#dlsym) :::info 因為在執行時利用 `LD_PRELOAD` 將 system call handler 動態 link 到我們自己寫的 system call handler，這邊 `dlsym` 用來找到原始版本的 system call handler 並備份。 * `LD_PRELOAD` 是個用來控制 ld.so 的行為的環境變數之一，它的效果是讓 ld.so 先載入變數中指定的動態函式庫，而因為動態函式庫的符號在解析時的特性，後面的不會覆蓋掉前面的，於是就可以透過 LD_PRELOAD 蓋掉自己想覆寫的函式了 ::: ## Run ESCA test 利用壓力測試工具 `wrk`，以 4 個 thread 和 50 個 connections 對 `lighttpd` 專案進行測試，測試時間為 5 秒。由實驗結果得知，加入 system call aggregation 機制，在每秒可處理的 requests 數量和資料傳輸量都多了 22 ％。 ### origin lighttpd ``` $ downloads/wrk-master/wrk -c 50 -d 5s -t 4 http://localhost:3000/a20.html Running 5s test @ http://localhost:3000/a20.html 4 threads and 50 connections Thread Stats Avg Stdev Max +/- Stdev Latency 519.95us 143.04us 5.00ms 93.63% Req/Sec 23.28k 2.88k 47.51k 93.56% 467823 requests in 5.10s, 9.20GB read Requests/sec: 91743.33 Transfer/sec: 1.80GB ``` ### lighttpd with ESCA ``` $ downloads/wrk-master/wrk -c 50 -d 5s -t 4 http://localhost:3000/a20.html Running 5s test @ http://localhost:3000/a20.html 4 threads and 50 connections Thread Stats Avg Stdev Max +/- Stdev Latency 306.03us 102.79us 1.17ms 72.10% Req/Sec 28.13k 5.24k 39.73k 60.78% 570920 requests in 5.10s, 11.23GB read Socket errors: connect 0, read 8176, write 0, timeout 0 Requests/sec: 111941.37 Transfer/sec: 2.20GB ``` ## Reference * [GitHub: ESCA](https://github.com/eecheng87/ESCA) * [ESCA thesis](https://eecheng87.github.io/ESCA/main.pdf) * [TLB](https://ithelp.ithome.com.tw/articles/10269930) * [Linux系統呼叫詳解（實現機制分析） ](https://zhuanlan.zhihu.com/p/267353577) * [System Call 筆記](https://hackmd.io/@combo-tw/Linux-%E8%AE%80%E6%9B%B8%E6%9C%83/%2F%40a29654068%2FHyD4Lu_Dr)