State of `seccomp` sandboxing

# State of `seccomp` sandboxing Issue: https://github.com/paritytech/polkadot/issues/4718 ## Summary With pay-as-you-go paras coming up, it will be easier to submit PVFs for preparation and execution. As PVFs are untrusted code, it is critically important that we have some sandboxing around PVF jobs to prevent disputes and other attacks on validators. (See "Possible attacks" section.) We are starting with `seccomp` sandboxing, which allows us to filter syscalls before they get executed by the kernel. When a syscall gets through the filter, we will initially just log it. Once we have enough confidence in the filter (after sufficient testing and fuzzing), we will enable the full sandbox, where a violation will kill the process outright. This leads to either the PVF not passing preparation (which should get caught in pre-checking), or if the violation happens during execution, a vote against the candidate. ## Current proposal NOTE: This is my view of how the project should proceed and I may be off or missing something. As more research is done our design/requirements may change. - [ ] Worker binaries - [ ] Separate the worker binaries. - [ ] Build them with musl. - [ ] Embed the binaries into `polkadot`. - [ ] Extract them onto disk before executing. - [ ] @koute's syscall detection script. - [ ] Run at build time. - [ ] Run in CI? - [ ] Still need to blacklist some of the detected syscalls. - See "Observed syscalls" section. - [ ] Log syscall violations - [ ] Extract from audit log and syslog - [ ] Send over telemetry, along with musl version? - [ ] How to get the log file location? (I will read kernel source.) - [ ] Testing - [ ] Fuzzing - [ ] ? - [ ] Eventually enable full sandboxing. - [ ] Kill the process on seccomp violations. - (See "Action on violations" section). - [ ] Initially just for Linux x86-64. - Already includes almost all validators. - [ ] `--insecure-mode` flag for untested/unsupported platforms. - [ ] Links to docs with explanation. - [ ] `--sandbox-mode` flag for controlling the level of sandboxing? - [ ] By default, highest level of security supported on current platform. ## Possible attacks / threat model Webassembly is already sandboxed, but there have already been reported multiple CVEs enabling remote code execution. See e.g. these two advisories from [Mar 2023](https://github.com/bytecodealliance/wasmtime/security/advisories/GHSA-ff4p-7xrq-q5r8) and [Jul 2022](https://github.com/bytecodealliance/wasmtime/security/advisories/GHSA-7f6x-jwh5-m9r4). See also [this comment](https://github.com/paritytech/polkadot/issues/5492#issuecomment-1462122032) for more background. So what are we actually worried about? Things that come to mind: 1. **Consensus faults** - if an attacker can get some source of randomness they could vote against with 50% chance and cause unresolvable disputes. 1. **Targeted slashes** - An attacker can target certain validators (e.g. some validators running on vulnerable hardware) and make them vote invalid and get them slashed. 1. **Mass slashes** - With some source of randomness they can do an untargeted attack, and this wouldn't require much of an escalation at all. I.e. a baddie can do significant economic damage by voting against with 1/3 chance, without even stealing keys or completely replacing the binary. 1. **Stealing keys** - that would be pretty bad. Should not be possible with sandboxing. We should at least not allow network access via `seccomp` for example. 1. **Taking control over the validator.** E.g. replacing the `polkadot` binary with a `polkadot-evil` binary. Should again not be possible with the above with sandboxing. 1. **Intercepting and manipulating packages** - effect very similar to the above, hard to do without also being able to do 4 or 5. ## Observed syscalls This is the current list of observed syscalls. We have used two methods so far: 1. Running the binaries with seccomp logging enabled (may underestimate) 2. @koute's [static analysis script](https://gist.github.com/koute/166f82bfee5e27324077891008fca6eb) (may overestimate). 1. NOTE: This is because the script gets syscalls for the whole process, and not just the thread. To start, we will the wider list produced by static analysis, with an additional blacklist (see <https://github.com/paritytech/polkadot/issues/4718#issuecomment-1511085231>), including `read`, `write`, `socket`, `getrandom`, etc. ### seccomp logging Observed syscalls from logs (point (1) above). #### prepare_worker (Apr 2023) 9 - mmap 11 - munmap 13 - rt_sigaction 14 - rt_sigprocmask 28 - madvise 60 - exit 98 - getrusage 131 - sigaltstack #### execute_worker (Apr 2023) 11 - munmap 14 - rt_sigprocmask 24 - sched_yield 28 - madvise 60 - exit 131 - sigaltstack 157 - prctl 202 - futex 228 - clock_gettime ### Static analysis Syscalls detected by [static analysis](https://gist.github.com/koute/166f82bfee5e27324077891008fca6eb) (point (2) above). #### prepare_worker (Apr 2023) <details> 0 (read) 1 (write) 2 (open) 3 (close) 4 (stat) 5 (fstat) 7 (poll) 8 (lseek) 9 (mmap) 10 (mprotect) 11 (munmap) 12 (brk) 13 (rt_sigaction) 14 (rt_sigprocmask) 15 (rtsigreturn) 20 (writev) 24 (sched_yield) 25 (mremap) 28 (madvise) 39 (getpid) 41 (socket) 42 (connect) 45 (recvfrom) 53 (socketpair) 55 (getsockopt) 56 (clone) 60 (exit) 61 (wait4) 62 (kill) 72 (fcntl) 79 (getcwd) 87 (unlink) 89 (readlink) 96 (gettimeofday) 97 (getrlimit) 98 (getrusage) 99 (sysinfo) 110 (getppid) 131 (sigaltstack) 144 (sched_setscheduler) 157 (prctl) 158 (arch_prctl) 200 (tkill) 202 (futex) 203 (sched_setaffinity) 204 (sched_getaffinity) 213 (epoll_create) 218 (set_tid_address) 228 (clock_gettime) 231 (exit_group) 232 (epoll_wait) 233 (epoll_ctl) 262 (newfstatat) 273 (set_robust_list) 281 (epoll_pwait) 284 (eventfd) 290 (eventfd2) 291 (epoll_create1) 302 (prlimit64) 309 (getcpu) 318 (getrandom) </details> ## Action on violations Right now we will just log syscalls that violate the filter, but eventually we will have to actually block them. `seccomp` offers a limited set of actions on filter violations. Keep in mind that ultimately we want to vote against the candidate (i.e. return `InvalidCandidate`). ### Possible actions - [ ] Kill the process - Default action. - Would cause `AWD` and kill whole worker, and eventually vote against. - [ ] Kill the thread - **We can't do this because killing a single thread is unsafe.** - From `man`: - the use of `SECCOMP_RET_KILL_THREAD` to kill a single thread in a multithreaded process is likely to leave the process in a permanently inconsistent and possibly corrupt state. - [ ] Send SIGSYS - Thread-directed signal. - So would need a signal handler in the working thread. - Could let us send a special error before killing the worker. - **Can't be handled done in a safe way** - Or if it can, it's highly untrivial. - See <https://github.com/vorner/signal-hook/issues/141>. - [ ] Return ENOSYS - See "Considered but not done >> Return ENOSYS" ### Conclusion The only viable action here is to kill the worker. The job will be retried, and if the syscall violation persists, we eventually vote against. ## Considered but not done ### Hooking into `uname` to mock the kernel version Some libc libraries may use `uname` to determine which syscalls to make. Mocking it would make all this more deterministc. However, the musl code does not do this. ### Return ENOSYS for blocked syscalls This idea was because there are different "versions" of essentially the same syscall, e.g. `clone` and `clone3`. If libc hit one that was blocked, it may still try another version of the blocked syscall. Returning ENOSYS would not abort the process when an unknown syscall is called. This could lead to non-deterministic results. See <https://github.com/paritytech/polkadot/issues/4718#issuecomment-1505292664>.