# State of `seccomp` sandboxing
Issue: https://github.com/paritytech/polkadot/issues/4718
## Summary
With pay-as-you-go paras coming up, it will be easier to submit PVFs for
preparation and execution. As PVFs are untrusted code, it is critically
important that we have some sandboxing around PVF jobs to prevent disputes and
other attacks on validators. (See "Possible attacks" section.)
We are starting with `seccomp` sandboxing, which allows us to filter syscalls
before they get executed by the kernel. When a syscall gets through the filter,
we will initially just log it. Once we have enough confidence in the filter
(after sufficient testing and fuzzing), we will enable the full sandbox, where a
violation will kill the process outright.
This leads to either the PVF not passing preparation (which should get caught in
pre-checking), or if the violation happens during execution, a vote against the
candidate.
## Current proposal
NOTE: This is my view of how the project should proceed and I may be off or
missing something. As more research is done our design/requirements may change.
- [ ] Worker binaries
- [ ] Separate the worker binaries.
- [ ] Build them with musl.
- [ ] Embed the binaries into `polkadot`.
- [ ] Extract them onto disk before executing.
- [ ] @koute's syscall detection script.
- [ ] Run at build time.
- [ ] Run in CI?
- [ ] Still need to blacklist some of the detected syscalls.
- See "Observed syscalls" section.
- [ ] Log syscall violations
- [ ] Extract from audit log and syslog
- [ ] Send over telemetry, along with musl version?
- [ ] How to get the log file location? (I will read kernel source.)
- [ ] Testing
- [ ] Fuzzing
- [ ] ?
- [ ] Eventually enable full sandboxing.
- [ ] Kill the process on seccomp violations.
- (See "Action on violations" section).
- [ ] Initially just for Linux x86-64.
- Already includes almost all validators.
- [ ] `--insecure-mode` flag for untested/unsupported platforms.
- [ ] Links to docs with explanation.
- [ ] `--sandbox-mode` flag for controlling the level of sandboxing?
- [ ] By default, highest level of security supported on current platform.
## Possible attacks / threat model
Webassembly is already sandboxed, but there have already been reported multiple CVEs enabling remote code execution. See e.g. these two advisories from [Mar 2023](https://github.com/bytecodealliance/wasmtime/security/advisories/GHSA-ff4p-7xrq-q5r8) and [Jul 2022](https://github.com/bytecodealliance/wasmtime/security/advisories/GHSA-7f6x-jwh5-m9r4). See also [this comment](https://github.com/paritytech/polkadot/issues/5492#issuecomment-1462122032) for more background.
So what are we actually worried about? Things that come to mind:
1. **Consensus faults** - if an attacker can get some source of randomness they
could vote against with 50% chance and cause unresolvable disputes.
1. **Targeted slashes** - An attacker can target certain validators (e.g. some
validators running on vulnerable hardware) and make them vote invalid and get
them slashed.
1. **Mass slashes** - With some source of randomness they can do an untargeted
attack, and this wouldn't require much of an escalation at all. I.e. a baddie
can do significant economic damage by voting against with 1/3 chance, without
even stealing keys or completely replacing the binary.
1. **Stealing keys** - that would be pretty bad. Should not be possible with
sandboxing. We should at least not allow network access via `seccomp` for
example.
1. **Taking control over the validator.** E.g. replacing the `polkadot` binary
with a `polkadot-evil` binary. Should again not be possible with the above
with sandboxing.
1. **Intercepting and manipulating packages** - effect very similar to the
above, hard to do without also being able to do 4 or 5.
## Observed syscalls
This is the current list of observed syscalls. We have used two methods so far:
1. Running the binaries with seccomp logging enabled (may underestimate)
2. @koute's [static analysis script](https://gist.github.com/koute/166f82bfee5e27324077891008fca6eb) (may overestimate).
1. NOTE: This is because the script gets syscalls for the whole process, and
not just the thread.
To start, we will the wider list produced by static analysis, with an additional
blacklist (see
<https://github.com/paritytech/polkadot/issues/4718#issuecomment-1511085231>),
including `read`, `write`, `socket`, `getrandom`, etc.
### seccomp logging
Observed syscalls from logs (point (1) above).
#### prepare_worker (Apr 2023)
9 - mmap
11 - munmap
13 - rt_sigaction
14 - rt_sigprocmask
28 - madvise
60 - exit
98 - getrusage
131 - sigaltstack
#### execute_worker (Apr 2023)
11 - munmap
14 - rt_sigprocmask
24 - sched_yield
28 - madvise
60 - exit
131 - sigaltstack
157 - prctl
202 - futex
228 - clock_gettime
### Static analysis
Syscalls detected by [static analysis](https://gist.github.com/koute/166f82bfee5e27324077891008fca6eb) (point (2) above).
#### prepare_worker (Apr 2023)
<details>
0 (read)
1 (write)
2 (open)
3 (close)
4 (stat)
5 (fstat)
7 (poll)
8 (lseek)
9 (mmap)
10 (mprotect)
11 (munmap)
12 (brk)
13 (rt_sigaction)
14 (rt_sigprocmask)
15 (rtsigreturn)
20 (writev)
24 (sched_yield)
25 (mremap)
28 (madvise)
39 (getpid)
41 (socket)
42 (connect)
45 (recvfrom)
53 (socketpair)
55 (getsockopt)
56 (clone)
60 (exit)
61 (wait4)
62 (kill)
72 (fcntl)
79 (getcwd)
87 (unlink)
89 (readlink)
96 (gettimeofday)
97 (getrlimit)
98 (getrusage)
99 (sysinfo)
110 (getppid)
131 (sigaltstack)
144 (sched_setscheduler)
157 (prctl)
158 (arch_prctl)
200 (tkill)
202 (futex)
203 (sched_setaffinity)
204 (sched_getaffinity)
213 (epoll_create)
218 (set_tid_address)
228 (clock_gettime)
231 (exit_group)
232 (epoll_wait)
233 (epoll_ctl)
262 (newfstatat)
273 (set_robust_list)
281 (epoll_pwait)
284 (eventfd)
290 (eventfd2)
291 (epoll_create1)
302 (prlimit64)
309 (getcpu)
318 (getrandom)
</details>
## Action on violations
Right now we will just log syscalls that violate the filter, but eventually we
will have to actually block them. `seccomp` offers a limited set of actions on
filter violations.
Keep in mind that ultimately we want to vote against the candidate (i.e. return
`InvalidCandidate`).
### Possible actions
- [ ] Kill the process
- Default action.
- Would cause `AWD` and kill whole worker, and eventually vote against.
- [ ] Kill the thread
- **We can't do this because killing a single thread is unsafe.**
- From `man`:
- the use of `SECCOMP_RET_KILL_THREAD` to kill a single thread in a
multithreaded process is likely to leave the process in a permanently
inconsistent and possibly corrupt state.
- [ ] Send SIGSYS
- Thread-directed signal.
- So would need a signal handler in the working thread.
- Could let us send a special error before killing the worker.
- **Can't be handled done in a safe way**
- Or if it can, it's highly untrivial.
- See <https://github.com/vorner/signal-hook/issues/141>.
- [ ] Return ENOSYS
- See "Considered but not done >> Return ENOSYS"
### Conclusion
The only viable action here is to kill the worker. The job will be retried, and
if the syscall violation persists, we eventually vote against.
## Considered but not done
### Hooking into `uname` to mock the kernel version
Some libc libraries may use `uname` to determine which syscalls to make. Mocking
it would make all this more deterministc. However, the musl code does not do
this.
### Return ENOSYS for blocked syscalls
This idea was because there are different "versions" of essentially the same
syscall, e.g. `clone` and `clone3`. If libc hit one that was blocked, it may
still try another version of the blocked syscall.
Returning ENOSYS would not abort the process when an unknown syscall is called.
This could lead to non-deterministic results. See
<https://github.com/paritytech/polkadot/issues/4718#issuecomment-1505292664>.