# Protect the System Call, Protect (most of) the World with BASTION ###### tags: `security` `system call` [website](https://www.unexploitable.systems/publication/jelesnianskibastion/) ## Abstract * enforces at **runtime** * 3 contexts * (Call type): which system call, how's call * (Control Flow) * (Argument Integrity) * Bastion * a compiler * a runtime monitor system * Case study * NGINX, SQLite, and vsFTPd * indication: reduce overhead of 0.60%, 2.01% 1.65% ## Introduction * Traditional methods: * debloating, system call filtering, and system call sandboxing * idea: disabling unused system calls * problem: still allow system calls to be invoked (cuz still needed, e.g code-reuse) * Contribution: * **Novel system call contexts for system call integrity** * 3 system call contexts, namely **Call Type**, **Control Flow**, and **Argument Integrity** contexts * **Bastion defense enforcing system call integrity** * compiler pass * analyzes all system call usage, performs instrumentation, and generates metadata * runtime monitor * static and dynamic aspects of each system call context * **Security & performance evaluation** * NGINX, SQLite, and vsFTPd ## Background ### System call Usage in Attaacks * 400+ system calls in recent Linux kernel * Only few system calls are desired by attackers, call **sensitive system calls** ### Current system call protection mechanisms * Attack surface reduction * Debloating techniques * Remove unused code * static program analysis or dynamic coverage analysis * Problem: many sensitive system calls are used forlibrary loading (mmap, mprotect) * System call filtering * Seccomp. system call filtering framework * User needs to define an allowlist/denylist given an application * Can restrict a system call argument * Problem: cannot remove sensitve-but-necessary system calls * Problem: restricting arguments to a constant value is applied across the entire application scope. **eg. there are to callsites of mprotect, one is read-only, another is read-executable, then the policy for all app scope is read-executable** * Control-flow integrity (CFI) * LLVM support (backward) * perform analysis to generage an allowed set of targets per-callsite, called an equivalence class(EC) * Problem: Larges ECs -> inconvient; Small ECs -> dangerous * Problem: runtime overhead, Eg. Intel PT * Data-flow integrity (DFI) * instrument every `load` and `store` instruction * Problem: overhead ## Contexts for system call integrity Q: What's a legitimate use of a system call? A: two variants: (1) control-flow integrity (2) data integrity In this paper, thress contexts are defined accrodingly 1. Call-type context 2. Control-flow context 3. Argument integrity context ### Call-type context only permitted system call are called in the right manner. * direct call or indirect call * direct call: `int ret = chmod("AAA", S_IWOTH)` * indirect call: function pointer * a system call is one of the categories: * not-callable * directly-callable * indirectly-callable * It is rare for system calls to be called from an indirect call site ### Control-flow context * Keep the valid pathes of all sensitive system calls, and enforce this context at runtime ### Argument integrity context * A system call argument type is either (1) a direct argument or (2) an extended argument * direct argument: eg. constant, local variable * indirect argument: eg. pointer * if there's struct, take care of the filed ### Real world code examples - [ ] Legitimate use of the execve system call in NGINX ```clike // nginx/src/os/unix/ngx_process.c static void ngx_execute_proc(ngx_cycle_t *cycle, void *data){ ngx_exec_ctx_t *ctx = data; // Legitimate NGINX usage of execve system call if (execve(ctx->path, ctx->argv, ctx->envp) == -1) { ... } exit(1); } // nginx/src/core/ngx_output_chain.c ngx_int_t ngx_output_chain(ngx_output_chain_ctx_t *ctx, ngx_chain_t *in){ ... if (in->next == NULL && ngx_output_chain_as_is(ctx, in->buf) ) { return ctx->output_filter(ctx->filter_ctx, in); } ... } ``` * ctx->output_filter = ngx_execute_proc * ctx->path = "/bin/sh" BASTION: * Call-type context: At static analysis, we know that "execve" has to be a direct call -> but at runtime, it is a indirect call * Control-Flow context: detect * Argument integreity context: detect - [ ] Snippet of NGINX code that can be compromised to reach and call the mprotect system call elsewhere by corrupting index in vulnerable code pointer v[index].get_handler() ```clike // nginx/src/http/ngx_http_variables.c ngx_http_variable_value_t *ngx_http_get_indexed_variable( ngx_http_request_t *r, ngx_uint_t index){ ... if (v[index].get_handler(r, &r->variables[index], v[index].data) == NGX_OK) { ngx_http_variable_depth++; if (v[index].flags & NGX_HTTP_VAR_NOCACHEABLE) { r->variables[index].no_cacheable = 1; } return &r->variables[index]; } ... return NULL; } ``` * buffer overflow: modify index, such that v[index].get_handler = mprotect * r = memory region to exploit * change permission BASTION: * mprotext does not have indirect call * control path is problematic * argument is wrong ## Threat model and assumptions * arbitrary memory read/write * Data Execution Prevention(DEP) * attackers cannot inject or modify code due to DEP * Address Space Layout Randomization (ASLR) * Shadow Stack (CET) * Hardware and OS kernel are trusted * Attackers going for OS kernel and hardware (Spectre) are out of scope * BASTION protects a subset of available system calls * ![](https://i.imgur.com/kccnj7E.png) ## BASTION design and implementation * Choose 20 sensiteve system calls (Table1) * BASTION = BASTION-compiler pass + BASTION-monitor ### BASTION compiler * Generate context metadata * Leverage light-weight library API for dynamic tracking of sensitive data (Argument integrity) #### Analysis for Call-Type Context * System calls in the program fall into three categories * not-callable * directly-callable * indirectlycallable * BASTION analyzes the entire program’s LLVM IR instructions and checks all call instructions. * directly-callable: a system call is a target of a direct function call * indirectly-callable: the address of a system call is taken and used in the left-hand-side of an assignment * not-callable * Metadata: * pairs of system call numbers and their call type * list of legitimate indirect callsites (i.e., offset in a program binary) #### Analysis for Control-Flow Context * performed only when a system call is invoked -> reduce overhead (CFL enforce for every indirect control-flow transfer) compile time: 1. generate CFG 2. identify all function callee→caller relationships that reach system call callsites 3. For each callsite, recursively records all callee→caller associations 4. stops once reaching either main() or an indirect function call runtime: 1. unwind stack frame 2. verifies callee→caller relations until the bottom of the stack (i.e., main), or an indirect callsite #### Analysis for Argument Integrity Context * check not only system call arguments but also an arguments’ data-dependent variables -> sensitive variables * maintains a shadow copy of the sensitive variable’s legitimate value in a shadow memory region * updates the shadow copy whenever the sensitive variable is updated legitimately * binds each argument to a certain position for the system call so the Bastion runtime monitor can check argument integrity * ![](https://i.imgur.com/OsN9AVp.png) ![](https://i.imgur.com/En8IjUd.png) 1. enumerates all variables used in system call arguments 2. performs a backward data-flow analysis, traversing the use-def chains to derive any other variables used to define sensitive variables * newly identified data-dependent variables are added to the set of sensitive variables 3. , if there is a write to a field of a struct (e.g., size field of gshm in Figure 2), that write is added to the sensitive variables **repeat 2, 3 until no new sensitive variables are found** * Once all sensitive variables are identified, Bastion instruments ctx_write_mem after any memory-backed sensitive variable store to keep its shadow copy up-to-date * Before each sensitive system call callsite, Bastion instruments ctx_bind_mem_X or ctx_-bind_const_X to bind an arguments to their respective argument position X ### Bastion runtime monitor #### Initializing the Bastion Monitor * Loading metadata: * The monitor retrieves ELF, DWARF, and linked library file information to recover symbol addresses * loads Bastion context metadata into the monitor’s memory * Launching a Bastion-protected application: * performs fork to spawn a child process where the child runs the Bastion-enabled application * initializes a shadow memory region under a segmentation register * initializes seccomp: trap on sensitive system calls in the child process * ptrace: access the application’s state * Trapping a system call invocation: * custom seccomp-BPF filter to trap on the application’s sensitive system call * SECCOMP_-RET_ALLOW: non-sensitive system calls, ignore * SECCOMP_RET_KILL: disables any notcallable system calls * SECCOMP_RET_TRACE: directlyand indirectly-callable system calls so these system calls can be verified by the Bastion monitor #### Enforcing Call-Type Context * Take $PC, look meatedata, check call type #### Enforcing Control-Flow Context * stack trace: unwinds and gets each function callsite offset * CFG metadata:a list of callees and their respective valid callers * until the entire stack has been vetted or an indirect call is encountered #### Enforcing Argument Integrity Context * verifies integrity of all sensitive variables in the current call stack * Take $PC, check the associated argument integrity context metadata ### Implementation * Linux x86-64 v5.19.14 * LLVM Module * hardware-based shadow stack ```-fcf-protection=full``` * Intel Tiger Lake and AMD Ryzen 7 processors * Glibc v2.28+ * Binutils v2.29+ Efforts: * LLVM module: 3,939 lines of code * Bastion’s C runtime library: 659 lines of code * Bastion runtime monitor is a C-program: 7313 lines of code ## EVALUATION ### Evaluation Methodology * 8-core (16-hardware thread) machine featuring an AMD Ryzen 7 PRO 5850U processor and 16 GB DDR4 memory * Bastion LLVM compiler * Results are reported average over five runs * **NGINX, SQLite, and vsftpd** ### Performance Evaluation * NGINX: * wrk, HTTP benchmarking tool * sends concurrent HTTP requests * measure throughput * NGINX maximum of 1,024 connections per processor * 32 worker threads * never incurred more than 0.60% degradation compared to the unprotected NGINX baseline * Argument Integrity context adds the most overhead * utilizes a vast sensitive system calls (e.g., mprotect, mmap) during its initialization phase while seldom using when idle or processing requests -> Bastion rarely being triggered during runtime * average call-depth is only 5.2 frames, with 4 and 9 being minimum and maximum stack call-depths * SQLite: * DBT2, database transaction processing benchmark * mix of read and write SQL operations for large data warehouse transactions * 10 second new thread delay and a 10 minute workload duration * number of new-order transactions per-minute (NOTPM) for performance * Overhead: * Call-Type: 0.92% * Control-Flow: 1.48% * Argument Integrity: 2.01% * VSFTPD: * dkftpbench, FTP benchmark program * fetch a 100 MB file from vsftpd launching clients one after another for a 120 second duration * Overhead: worst 1.65% ![](https://i.imgur.com/NkERhKZ.png) * Argument Integrity context is most costly * LLVM CFI is expensive cuz it is triggered for every indirect callsite => NGINX does not have many indirect call ![](https://i.imgur.com/YJVBSzl.png) * mprotect, mmap ![](https://i.imgur.com/O2SOXJf.png) * Sensitive system calls are never called indirectly via a function pointer * Compare with CET, LLVM CFI * CET: maintain a secondary (shadow) stack. Upon returning from a function, the CPU compares return addresses in the shadow stack and the normal stack * LLVM CFI: verification at every indirect callsite ## SECURITY EVALUATION ### ROP Attacks * libc library call `system` = `fork` + `execl` * exec-type system call, to create access to a root shell * `mprotect` or `chmod` system calls to change memory or file permissions to be executable ### Direct Attacker Manipulation of System Calls * Go after system calls directly, setup callsites and arguments to desired values * The **CsCFI** attack leverages mprotect to make the entire libc readable, writable, and executable, revealing the code layout to perform arbitrary code execution * **AOCR’s Attack 1** open and write to reveal the code layout of NGINX to execute arbitrary code * Control-Flow and Argument integrity contexts to detect * LLVM CFI cannot defend against either attack. * In the CsCFI attack, mprotect is never used, its address is still taken as this system call is necessary to support dynamic loading of shared libraries ### Indirect Attack Manipulation of System Calls * full-function code re-use, data-oriented attacks, and COOP * The **NEWTON CPI attack** avoids corrupting any code or data pointers. It corrupts the index variable of an array of function pointers to make the array index point to a system call location * Call-Type context blocks the invocation of a system call never used in the program code base ## DISCUSSION AND LIMITATIONS ### Bastion under Arbitrary Memory Corruption * To bypass all three of Bastion’s contexts, the attacker realistically would needto perform arbitrary read/write many times to match the expected context values without violating static constraints ### Protecting Filesystem Related System Calls * The main challenge is that this type of system call is called much more frequently * Full Bastion context checking incurs high overhead – e.g., 96.7% for NGINX * A majority of overhead results from fetching protected process state using `ptrace` (< 95.7%, delta between Rows 1 and 2) * additional context switching overhead to access the protected program * eliminate ptrace overhead would be to run the Bastion monitor inside the kernel