<style>
.reveal {
font-size: 26px;
}
</style>
# Linux namespaces or building blocks of container runtimes
<!-- Put the link to this slide here so people can follow -->
* slide: https://hackmd.io/@tvannahl/namespaces
* based uppon: [Understanding user namespaces](http://man7.org/conf/meetup/understanding-user-namespaces--Google-Munich-Kerrisk-2019-10-25.pdf) *by Michael Kerrisk*
* [Article](https://www.heise.de/hintergrund/Kubernetes-Security-Teil-1-Von-Linux-geerbte-Konzepte-4703935.html) (*german* by Thomas Fricke) with a network namespaces demo
---
## What is a virtual machine
These days:
* Memory protection through CPU and mainboard features like VT-x, VT-d, …
* Software emulated hardware (e.g. )
---
## Linux set of contaiment tools
Memory protection through Linux kernel and CPU ring architecture and rarely emulated hardware.
* cgroups
* capabilities
* namespaces
* seccomp
---
## (Traditional) superuser and set-UID-`root`
* UNIX privilege model
- **normal user** vs **root** (UID `0`)
* set-UID-root
---
## Background: capabilities `CAP*`
* Divides super user abilities into smaller pieces
+ 38 capabilities (`capabilities(7)`)
+ Examples:
- `CAP_SYS_ADMIN`
- `CAP_SYS_TIME`
- `CAP_DAC_OVERRIDE`
* Use `setcap(8)` instead of set-UID-root.
<!-- * See `systemd.exec(5)` section **Capabilities** -->
---
## Background: capabilities
* [Docker](https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities)
* `--cap-add`, `--cap-drop`
* `--privileged` (sets `CAP_SYS_ADMIN`)
* In defaults: `CAP_DAC_OVERRIDE`
---
## Background: capabilities
In [Kubernetes](https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-capabilities-for-a-container) via [securityContext](https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#securitycontext-v1-core)
```yaml
apiVersion: v1
kind: Pod
metadata: {…}
spec:
containers:
- name: …
image: …
securityContext:
privileged: false
capabilities:
drop: ["NET_RAW", "DAC_OVERRIDE"]
```
---
## Namespaces
* A namespace (NS) is an isolation tool for global system resources
* The interface started with **mount** namespace isolation in 2002
* Extension of the UNIX approach with `chroot(2)`
* Often compared to FreeBSD Jails
---
## Available namespaces
1. Mount NS (2002)
2. UTS NS (2006)
3. IPC NS (2006)
4. PID NS (2008)
5. Network NS (2009)
6. User NS (2013)
7. Cgroup NS (2016)
8. Time (2020)
---
## APIs and commands
**system calls**:
* `clone(2)`
* `unshare(2)`
* `setns(2)`
**shell commands**:
* `unshare(1)`
* `nsenter(1)`
---
## Visualize via `/proc`
```
# ls -l /proc/1/ns
total 0
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 net -> 'net:[4026531992]'
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 pid -> 'pid:[4026531836]'
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 user -> 'user:[4026531837]'
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 uts -> 'uts:[4026531838]'
# ls -l /proc/$$/ns
total 0
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 net -> 'net:[4026531992]'
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 pid -> 'pid:[4026531836]'
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 user -> 'user:[4026531837]'
lrwxrwxrwx. 1 root root 0 Apr 6 13:44 uts -> 'uts:[4026531838]'
```
---
## Example UTS NS
* Name comes from `struct utsname` argument of `uname(2)`
* The struct name derives from "UNIX Timesharing System"
* Isolates syscalls:
+ `uname(2)`
+ `sethostname(2)`
+ `setdomainname(2)`
---
## UTS NS Demo
*-- live demo --*
using `unshare(1)` and `nsenter(1)`
---
## Security issues in the past
* Some piece of the kernel code not converted to Namespaces and Capabilities
* unpriviledged user gains capability in child NS and uses it in **outer** NS
* User NS implementation changed a lot code
---
## Noteworthy high level tools
* Linux containers such as Docker, Podman, LXC
* [Chrome-style sandboxes](https://chromium.googlesource.com/chromium/src/+/master/docs/design/sandbox.md)
* Firejail (https://firejail.wordpress.com)
* Flatpak
---
## Additional read
* [OpenBSD's `unveil(2)`](https://lwn.net/Articles/767137/)
* [Seccomp BPF](https://dri.freedesktop.org/docs/drm/userspace-api/seccomp_filter.html)
* FreeBSD Jails
---
## *Bonus*: Namespaces in systemd
* Flags in `systemd.unit(5)`:
- `JoinsNamespaceOf=`
* Flags in `systemd.exec(5)` (Section **Sandboxing**):
- `ProtectSystem=`, `ProtectHome=`
- `ReadOnlyPaths=`, `InaccessablePaths=`
- `NetworkNamespacePath=`
- `PrivateTmp=`, `PrivateDevices=`, `PrivateNetwork=`
- `RestrictNamespaces=`
- `NetworkNamespacePath=`
- `ProtectHostname=`
{"metaMigratedAt":"2023-06-15T06:01:28.550Z","metaMigratedFrom":"YAML","title":"Linux namespaces or building blocks of container runtimes","breaks":true,"description":"Basic introduction into Linux namespaces.","contributors":"[{\"id\":\"bb7dc7c9-31fd-4a5c-a52d-d96ddf9fa911\",\"add\":11067,\"del\":5677}]"}